Thread: [BUGS] BUG #14808: V10-beta4, backend abort

[BUGS] BUG #14808: V10-beta4, backend abort

From

phb07@apra.asso.fr

Date:

09 September 2017, 09:48:53

The following bug has been logged on the website:

Bug reference:      14808
Logged by:          Philippe BEAUDOIN
Email address:      phb07@apra.asso.fr
PostgreSQL version: 10beta4
Operating system:   Linux
Description:

Hi all,

While continuing to play with transition tables in statement level trigger,
I have encountered what looks like a backend abort.

I have been able to reproduce the case with the following simple script:

#!/bin/sh
export PGHOST=localhost
export PGPORT=5410

dropdb test
createdb test

psql test  <<*EOF*
\set ON_ERROR_STOP on

CREATE OR REPLACE FUNCTION create_tbl(grpdef_schema TEXT, grpdef_tblseq
TEXT)
RETURNS VOID LANGUAGE plpgsql SECURITY DEFINER AS
\$_create_tbl\$ DECLARE   v_fullTableName     TEXT;   v_logTableName      TEXT;   v_logFnctName       TEXT;
v_colList1         TEXT;   v_colList2          TEXT;   v_colList3          TEXT;   v_colList4          TEXT; BEGIN
 
-- build the different name for table, trigger, functions,...   v_fullTableName   = grpdef_schema || '.' ||
grpdef_tblseq;  v_logTableName    = grpdef_tblseq || '_log';   v_logFnctName     = grpdef_tblseq || '_log_idx';
 
-- build the tables's columns lists   SELECT string_agg('tbl.' || col_name, ','),          string_agg('o.' || col_name
||' AS ' || col_name_o || ', n.' || 
col_name || ' AS ' || col_name_n, ','),          string_agg('r.' || col_name_o, ','),          string_agg('r.' ||
col_name_n,',')     INTO v_colList1, v_colList2, v_colList3, v_colList4 FROM (     SELECT quote_ident(attname) AS
col_name,quote_ident('o_' || attname) 
AS col_name_o, quote_ident('n_' || attname) AS col_name_n       FROM pg_catalog.pg_attribute       WHERE attrelid =
v_fullTableName::regclass        AND attnum > 0 AND NOT attisdropped       ORDER BY attnum) AS t;
 
-- create the log table: it looks like the application table, with some
additional technical columns   EXECUTE 'DROP TABLE IF EXISTS ' || v_logTableName;   EXECUTE 'CREATE TABLE ' ||
v_logTableName       || ' (LIKE ' || v_fullTableName || ') ';   EXECUTE 'ALTER TABLE ' || v_logTableName        || '
ADDCOLUMN verb      VARCHAR(3),'        || ' ADD COLUMN tuple     VARCHAR(3)';
 
-- create the log function   EXECUTE 'CREATE OR REPLACE FUNCTION ' || v_logFnctName || '() RETURNS
TRIGGER AS \$logfnct\$'        || 'DECLARE'        || '  r          RECORD;'        || 'BEGIN'        || '  IF
(TG_OP= ''DELETE'') THEN'        || '    INSERT INTO ' || v_logTableName || ' SELECT ' || v_colList1 
|| ', ''DEL'', ''OLD'' FROM old_table tbl;'        || '  ELSIF (TG_OP = ''INSERT'') THEN'        || '    INSERT INTO '
||v_logTableName || ' SELECT ' || v_colList1 
|| ', ''INS'', ''NEW'' FROM new_table tbl;'        || '  ELSIF (TG_OP = ''UPDATE'') THEN'        || '    FOR r IN'
 || '      WITH'        || '          o AS (SELECT ' || v_colList1 || ', row_number() OVER 
() AS ord FROM old_table tbl'        || '          ),'        || '          n AS (SELECT ' || v_colList1 || ',
row_number()OVER 
() AS ord FROM new_table tbl'        || '      )'        || '      SELECT ' || v_colList2        || '      FROM o JOIN
nUSING(ord)'        || '      LOOP'        || '        INSERT INTO ' || v_logTableName         || '          SELECT '
||v_colList3 || ', ''UPD'', ''OLD'';'        || '        INSERT INTO ' || v_logTableName        || '          SELECT '
||v_colList4 || ', ''UPD'', ''NEW'';'        || '    END LOOP;'        || '  END IF;'        || '  RETURN NULL;'
||'END;'        || '\$logfnct\$ LANGUAGE plpgsql SECURITY DEFINER;';
 
-- creation of the log trigger on the application table, using the
previously created log function   EXECUTE 'CREATE TRIGGER insert_log_trg'        || '  AFTER INSERT ON ' ||
v_fullTableName|| ' REFERENCING NEW 
TABLE AS new_table'        || '  FOR EACH STATEMENT EXECUTE PROCEDURE ' || v_logFnctName ||
'()';   EXECUTE 'CREATE TRIGGER update_log_trg'        || '  AFTER UPDATE ON ' || v_fullTableName || ' REFERENCING OLD
TABLE AS old_table NEW TABLE AS new_table'        || '  FOR EACH STATEMENT EXECUTE PROCEDURE ' || v_logFnctName ||
'()';   EXECUTE 'CREATE TRIGGER delete_log_trg'        || '  AFTER DELETE ON ' || v_fullTableName || ' REFERENCING OLD
TABLE AS old_table'        || '  FOR EACH STATEMENT EXECUTE PROCEDURE ' || v_logFnctName ||
'()';   RETURN; END;
\$_create_tbl\$;

CREATE TABLE myTbl1 ( col11       INT      NOT NULL, col12       TEXT     , col13       TEXT     , PRIMARY KEY (col11)
);

CREATE TABLE myTbl3 ( col41       INT      NOT NULL, col44       INT      , PRIMARY KEY (col41), FOREIGN KEY (col44)
REFERENCESmyTbl1 (col11) ON DELETE CASCADE ON UPDATE 
SET NULL
);

select create_tbl('public','mytbl1');
select create_tbl('public','mytbl3');

insert into myTbl1 select i, 'ABC', 'abc' from generate_series (1,10100) as
i;
update myTbl1 set col13=E'\\034'::bytea where col11 <= 500;
delete from myTbl1 where col11 > 10000;

*EOF*

As a result, the last DELETE statement fails. I get:

CREATE FUNCTION
CREATE TABLE
CREATE TABLE
NOTICE:  table "mytbl1_log" does not exist, skippingcreate_tbl 
------------
(1 row)

NOTICE:  table "mytbl3_log" does not exist, skippingcreate_tbl 
------------
(1 row)

INSERT 0 1101
UPDATE 0
server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing
therequest.
 
connection to server was lost


The postgresql.conf file has default parameters, except:
listen_addresses = '*'
port = 5410
max_prepared_transactions 5
logging_collector = on
track_functions = all
track_commit_timestamp = on

Best regards.
Philippe Beaudoin.



--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Michael Paquier

Date:

09 September 2017, 14:27:05

On Sat, Sep 9, 2017 at 3:48 PM,  <phb07@apra.asso.fr> wrote:
> INSERT 0 1101
> UPDATE 0
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> connection to server was lost

Crash is confirmed here so I am adding an open item. I have not dug
into the issue seriously, but at short glance we have a code path
trying to access something that has already been free'd:
#0  0x0000561bfe0f0959 in tuplestore_tuple_count
(state=0x7f7f7f7f7f7f7f7f) at tuplestore.c:548
548        return state->tuples;
(gdb) bt
#0  0x0000561bfe0f0959 in tuplestore_tuple_count
(state=0x7f7f7f7f7f7f7f7f) at tuplestore.c:548
#1  0x0000561bfdd8ca22 in SPI_register_trigger_data
(tdata=0x7ffc92083860) at spi.c:2764
#2  0x00007f7075da7156 in plpgsql_exec_trigger (func=0x561bfe7bc5a8,
trigdata=0x7ffc92083860) at pl_exec.c:692
#3  0x00007f7075da08e7 in plpgsql_call_handler (fcinfo=0x7ffc920833d0)
at pl_handler.c:24
(gdb) p state
$1 = (Tuplestorestate *) 0x7f7f7f7f7f7f7f7
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

09 September 2017, 20:08:50

Michael Paquier <michael.paquier@gmail.com> writes:
> Crash is confirmed here so I am adding an open item. I have not dug
> into the issue seriously, but at short glance we have a code path
> trying to access something that has already been free'd:

I think this can be blamed on commit c46c0e52.

What is happening is that the AFTER triggers are queuing more triggers,
which have TransitionCaptureStates associated, but 
ExecEndModifyTable thinks it can DestroyTransitionCaptureState
unconditionally.  When the subsidiary triggers eventually get executed,
their ats_transition_capture pointers are pointing at garbage.

My first instinct is to get rid of DestroyTransitionCaptureState
altogether, on the grounds that the TransitionCaptureState will
go away at transaction cleanup and we can't really get rid of it
any sooner than that.

The other question that seems pretty relevant is why the subsidiary
triggers, which are the constraint triggers associated with the
tables' foreign keys, are getting queued with TransitionCaptureState
pointers in the first place.  This seems horridly expensive and
unnecessary.  It also directly contradicts the claim in
MakeTransitionCaptureState that constraint triggers cannot have
transition tables.
        regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

09 September 2017, 21:59:54

I wrote:
> What is happening is that the AFTER triggers are queuing more triggers,
> which have TransitionCaptureStates associated, but 
> ExecEndModifyTable thinks it can DestroyTransitionCaptureState
> unconditionally.  When the subsidiary triggers eventually get executed,
> their ats_transition_capture pointers are pointing at garbage.

On closer inspection, the issue is specific to TCS-using AFTER triggers
that are being fired as a result of foreign key enforcement triggers.
In the example, each row deleted from myTbl1 causes firing of the
ON DELETE CASCADE enforcement trigger, which will execute a DELETE
against myTbl3.  That won't delete anything (since myTbl3 is empty)
but we nonetheless queue a firing of the TCS-using AFTER STATEMENT
trigger for myTbl3.

Now, the ExecEndModifyTable instance for the DELETE supposes that
all TCS-using triggers must have been fired during ExecutorFinish;
but *that doesn't happen if EXEC_FLAG_SKIP_TRIGGERS is set*, which
it is because ri_PerformCheck tells SPI not to fire triggers.

I do not recall the exact details of why that is, but I believe
the intention is to fire RI-triggered triggers at the end of the
outer statement rather than exposing the individual RI action as
a statement.

I made a quick hack to not delete the TCS if EXEC_FLAG_SKIP_TRIGGERS
is set, as attached.  The example succeeds given this.  However,
note that the AFTER STATEMENT trigger will be fired once per myTbl1 row
deletion, each time seeing a transition table consisting of only the
myTbl3 rows that went away as a result of that one single row deletion.
I'm not sure that this is per the letter or spirit of the SQL spec.

If it isn't, I don't think there is much we can do about it for v10.
In v11 or later, we could think about somehow merging the transition
tables for all the RI actions triggered by one statement.  This might
well need to go along with rewriting the RI framework to use statement
triggers and TCS state itself.  I think people had already muttered
about doing that.

> The other question that seems pretty relevant is why the subsidiary
> triggers, which are the constraint triggers associated with the
> tables' foreign keys, are getting queued with TransitionCaptureState
> pointers in the first place.  This seems horridly expensive and
> unnecessary.  It also directly contradicts the claim in
> MakeTransitionCaptureState that constraint triggers cannot have
> transition tables.

The other part of the attached patch tweaks AfterTriggerSaveEvent
to not store an ats_transition_capture pointer for a deferrable
trigger event.  This doesn't have anything directly to do with
the current bug, because although the RI triggers are being stored
with such pointers, they aren't actually dereferencing them.  However,
it seems like a good idea anyway, to (1) ensure that we don't have
dangling pointers in the trigger queue, and (2) possibly allow
more merging of shared trigger states.

            regards, tom lane

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bd84778..49586a3 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ExecEndModifyTable(ModifyTableState *nod
*** 2318,2325 ****
  {
      int            i;

!     /* Free transition tables */
!     if (node->mt_transition_capture != NULL)
          DestroyTransitionCaptureState(node->mt_transition_capture);

      /*
--- 2318,2331 ----
  {
      int            i;

!     /*
!      * Free transition tables, unless this query is being run in
!      * EXEC_FLAG_SKIP_TRIGGERS mode, which means that it may have queued AFTER
!      * triggers that won't be run till later.  In that case we'll just leak
!      * the transition tables till end of (sub)transaction.
!      */
!     if (node->mt_transition_capture != NULL &&
!         !(node->ps.state->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
          DestroyTransitionCaptureState(node->mt_transition_capture);

      /*
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index da0850b..bbfbc06 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5474,5480 ****
          new_shared.ats_tgoid = trigger->tgoid;
          new_shared.ats_relid = RelationGetRelid(rel);
          new_shared.ats_firing_id = 0;
!         new_shared.ats_transition_capture = transition_capture;

          afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth],
                               &new_event, &new_shared);
--- 5474,5482 ----
          new_shared.ats_tgoid = trigger->tgoid;
          new_shared.ats_relid = RelationGetRelid(rel);
          new_shared.ats_firing_id = 0;
!         /* deferrable triggers cannot access transition data */
!         new_shared.ats_transition_capture =
!             trigger->tgdeferrable ? NULL : transition_capture;

          afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth],
                               &new_event, &new_shared);

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

10 September 2017, 22:02:53

I wrote:
> Now, the ExecEndModifyTable instance for the DELETE supposes that
> all TCS-using triggers must have been fired during ExecutorFinish;
> but *that doesn't happen if EXEC_FLAG_SKIP_TRIGGERS is set*, which
> it is because ri_PerformCheck tells SPI not to fire triggers.

In view of the fact that 10rc1 wraps tomorrow, there doesn't seem
to be time to do anything better about this than my quick hack.
So I went ahead and pushed that, with a regression test case.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

11 September 2017, 00:46:52

On Mon, Sep 11, 2017 at 7:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Now, the ExecEndModifyTable instance for the DELETE supposes that
>> all TCS-using triggers must have been fired during ExecutorFinish;
>> but *that doesn't happen if EXEC_FLAG_SKIP_TRIGGERS is set*, which
>> it is because ri_PerformCheck tells SPI not to fire triggers.
>
> In view of the fact that 10rc1 wraps tomorrow, there doesn't seem
> to be time to do anything better about this than my quick hack.
> So I went ahead and pushed that, with a regression test case.

Thank you for the testing and report Philippe, and for the analysis and fix Tom.

@@ -2318,8 +2318,14 @@ ExecEndModifyTable(ModifyTableState *node){       int                     i;

-       /* Free transition tables */
-       if (node->mt_transition_capture != NULL)
+       /*
+        * Free transition tables, unless this query is being run in
+        * EXEC_FLAG_SKIP_TRIGGERS mode, which means that it may have
queued AFTER
+        * triggers that won't be run till later.  In that case we'll just leak
+        * the transition tables till end of (sub)transaction.
+        */
+       if (node->mt_transition_capture != NULL &&
+               !(node->ps.state->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
DestroyTransitionCaptureState(node->mt_transition_capture);

As an idea for a v11 patch, I wonder if it would be better to count
references instead of leaking the TCS as soon as fk-on-cascade
triggers enter the picture.  The ModifyTable node would release its
reference in ExecEndModifyTable(), and the queued events' references
would be counted too.  I briefly considered a scheme like that before
proposing 501ed02c, but concluded, as it turns out incorrectly, that
there was no way for a trigger referencing node->mt_transition_capture
to fire after that point.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

11 September 2017, 16:08:44

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> As an idea for a v11 patch, I wonder if it would be better to count
> references instead of leaking the TCS as soon as fk-on-cascade
> triggers enter the picture.

I thought of reference counting as well, but I think it's not really
necessary if we organize things correctly. The key problem here IMO
is that the transition tables need to be attached to a trigger execution
context, rather than supposing that they can be completely managed by
ModifyTable plan nodes.

I dug around in the SQL spec and convinced myself that indeed it does
require all RI updates triggered by a single statement to be presented
in a single transition table. Concretely, it says
A trigger execution context consists of a set of state changes. Within a trigger execution context, each
statechange is uniquely identified by a trigger event, a subject table, and a column list. The trigger
eventcan be DELETE, INSERT, or UPDATE. A state change SC contains a set of transitions, a set of statement-level
triggers considered as executed for SC, and a set of row-level triggers, each paired with the set of rows
inSC for which it is considered as executed.

Note the "uniquely identified" bit --- you aren't allowed to have multiple
SCs for the same table and same kind of event, at least up to the bit
about column lists. I've not fully wrapped my head around the column list
part of it, but certainly all effects of a particular RI constraint will
have the same column list.

Now, the lifespan of a trigger execution context is one SQL-statement,
but that's one user-executed SQL-statement --- the queries that we gin
up for RI enforcement are an implementation detail that don't get their
own context. (The part of the spec that defines RI actions seems pretty
clear that the actions insert state changes into the active statement's
trigger execution context, not create their own context.)

It's also interesting in this context to re-read commit 9cb840976,
which is what created the "skip trigger" business. That exhibits a
practical problem that you hit if you don't do it like this.

So ISTM that basically we need trigger.c to own the transition tables.
ModifyTable, instead of just creating a TCS, needs to ask trigger.c for a
TCS relevant to its target table and command type (and column list if we
decide we need to bother with that). trigger.c would discard TCSes during
AfterTriggerEndQuery, where it closes out a given query_depth level.

This actually seems like it might not be that big a change, other than
the issue of what do the column lists mean exactly. Maybe we should try
to get it done for v10, rather than shipping known-non-spec-compliant
behavior.
regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Alvaro Herrera

Date:

11 September 2017, 16:18:41

Tom Lane wrote:

> This actually seems like it might not be that big a change, other than
> the issue of what do the column lists mean exactly.  Maybe we should try
> to get it done for v10, rather than shipping known-non-spec-compliant
> behavior.

I think this means we need to abort RC1 today and get another beta out.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

11 September 2017, 20:03:51

I wrote:
> Note the "uniquely identified" bit --- you aren't allowed to have multiple
> SCs for the same table and same kind of event, at least up to the bit
> about column lists. I've not fully wrapped my head around the column list
> part of it, but certainly all effects of a particular RI constraint will
> have the same column list.

After further study of the standard, it seems that the column list
business is meant to allow identification of which UPDATE triggers
with column lists ("AFTER UPDATE OF columnList" syntax) are supposed
to be triggered. The spec is confusing because they describe this in
a way that would be impossibly inefficient if implemented verbatim.
They say that, given a row UPDATE affecting a certain set of columns,
you're supposed to create state changes labeled with every possibly
relevant trigger column set:

— Let OC be the set of column names identifying the columns [being updated]
— Let PSC be the set consisting of the empty set and every subset of the set of column names of [the target table] that
hasat least one column that is in OC

- [ create a state change for each element of PSC ]

Then an update trigger is triggered by a particular state change if its
column list exactly matches that state change's list. This seems like a
remarkably stupid way to go about it; you'd end up with many state changes
that correspond to no existing trigger and are never of any use.

However, if I'm reading it right, there is a property of this that is
very significant in the context of RI updates. For an ordinary UPDATE
SQL command, all the row updates have the same set of target columns,
but that's not necessarily true for all the RI updates that a single SQL
command could trigger. If there's more than one RI constraint between
two tables then an update on the referenced table could trigger sets of
updates that affect overlapping, but not identical, sets of rows in the
referencing tables --- and those sets would have different sets of target
columns. So a given column-specific trigger might be interested in some
or all of the RI updates. And if it is interested, and is a statement
trigger, it is supposed to be fired just once with a transition table
containing all the rows it is interested in.

In other words, UPDATE triggers with different column lists potentially
need to see different transition tables, and any given row that was
updated might need to appear in some subset of those tables.

This seems like rather a mess to implement. I wonder whether I'm
reading it right, and whether other DBMSes actually implement it
like that.

I think that what might be a good short-term solution is to refuse
creation of column-specific triggers with transition tables (ie,
if you ask for a transition table then you can only say AFTER UPDATE
not AFTER UPDATE OF columnList). Then, all TT-using triggers are
interested in all modified rows and we don't have to distinguish
different column lists for the purposes of transition tables.
Then the problem reduces to one TCS per target table and event type,
which doesn't seem too hard to do.
regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Robert Haas

Date:

11 September 2017, 22:17:44

On Sat, Sep 9, 2017 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My first instinct is to get rid of DestroyTransitionCaptureState
> altogether, on the grounds that the TransitionCaptureState will
> go away at transaction cleanup and we can't really get rid of it
> any sooner than that.

End of transaction, or end of query?  I'm not sure what happens when
triggers are deferred, but I think there are a lot of cases when we
want to throw away the tuplestore immediately, not hold on to it for
the rest of the transaction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

11 September 2017, 22:24:40

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Sep 9, 2017 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My first instinct is to get rid of DestroyTransitionCaptureState
>> altogether, on the grounds that the TransitionCaptureState will
>> go away at transaction cleanup and we can't really get rid of it
>> any sooner than that.

> End of transaction, or end of query?  I'm not sure what happens when
> triggers are deferred, but I think there are a lot of cases when we
> want to throw away the tuplestore immediately, not hold on to it for
> the rest of the transaction.

As things stand, it's end of subtransaction, because the TCSes
are allocated in CurTransactionContext.  See the argument in
MakeTransitionCaptureState.

And yes, this is inefficient.  The quick-hack patch I committed yesterday
only pays the price if you have RI triggers cascading changes into a table
that also has triggers-with-transition-tables, but I can certainly believe
that it could get expensive in such a case.

The fix proposal discussed downthread should fix the inefficiency as well
as the spec compliance problem.  But personally I'm far more worried about
the latter.
        regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

12 September 2017, 03:38:28

On Tue, Sep 12, 2017 at 5:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Note the "uniquely identified" bit --- you aren't allowed to have multiple
>> SCs for the same table and same kind of event, at least up to the bit
>> about column lists.  I've not fully wrapped my head around the column list
>> part of it, but certainly all effects of a particular RI constraint will
>> have the same column list.

Aside from the RI case, the other user visible change in behaviour
will be for statements that update the same table via multiple
ModifyTable nodes (wCTEs).  Our regression test has:

with wcte as (insert into table1 values (42)) insert into table2 values ('hello world');

... which demonstrates the fix for the original complaint that table1
and table2 earlier tried to use the same transition table (and
crashed).  A new variant inserting into table1 twice would show the
difference.  Today we get:

postgres=# with wcte as (insert into table1 values (42))            insert into table1 values (43);
NOTICE:  trigger = table1_trig, new table = (43,)
NOTICE:  trigger = table1_trig, new table = (42,)
INSERT 0 1

Presumably with your change there will be a single transition table
for inserts into table, holding both (42,) and (43,).  But will we
fire the trigger once or twice?  There is something fishy about making
it fire twice but show the same tuples to both invocations (for
example, it might break Kevin's proposed counting algorithm which this
feature is intended to support), but firing only once requires some
new inter-node co-ordination.

> In other words, UPDATE triggers with different column lists potentially
> need to see different transition tables, and any given row that was
> updated might need to appear in some subset of those tables.
>
> This seems like rather a mess to implement.  I wonder whether I'm
> reading it right, and whether other DBMSes actually implement it
> like that.

I guess the alternative is storing extra per-tuple metadata,
transferring more work to the reader.

The DB2 documentation has this example[1]:
CREATE TRIGGER REORDER    AFTER UPDATE OF ON_HAND, MAX_STOCKED ON PARTS    REFERENCING NEW_TABLE AS NTABLE    FOR EACH
STATEMENTMODE DB2SQL      BEGIN ATOMIC        SELECT ISSUE_SHIP_REQUEST(MAX_STOCKED - ON_HAND, PARTNO)          FROM
NTABLE       WHERE (ON_HAND < 0.10 * MAX_STOCKED);    END

I can't find any explicit discussion of whether this trigger could
ever see a transition row that results from an update that didn't name
ON_HAND or MAX_STOCKED.  I don't have DB2 access and I'm not sure how
I'd test that...  maybe with a self-referencing fk declared ON UPDATE
CASCADE?

Thanks to prodding from Peter Geoghegan we tackled the question of
whether the <trigger action time> clause controls just when the
trigger fires or also which transition tuples it sees.  By looking at
some wording relating to MERGE we concluded it must be both,
culminating in commit 8c55244a which separates the UPDATEs and INSERTs
resulting from INSERT ... ON CONFLICT DO UPDATE.  That had the
annoying result that we had to ban the use of (non-standard) "OR" in
<trigger action time> when transition tables are in play.  This FOR
UPDATE OF ... transition business seems sort of similar, but would
create arbitrarily many transition tables and require the executor to
write into all of them, or perhaps store extra meta data along with
captured rows for later filtering during scan.

> I think that what might be a good short-term solution is to refuse
> creation of column-specific triggers with transition tables (ie,
> if you ask for a transition table then you can only say AFTER UPDATE
> not AFTER UPDATE OF columnList).  Then, all TT-using triggers are
> interested in all modified rows and we don't have to distinguish
> different column lists for the purposes of transition tables.
> Then the problem reduces to one TCS per target table and event type,
> which doesn't seem too hard to do.

+1

Presumably would-be authors of triggers-with-transition-tables that
would fire only AFTER UPDATE OF foo already have to deal with the
possibility that you updated foo to the same value.  So I don't think
too much is lost, except perhaps some efficiency.

Thinking a bit harder about whether you might have semantic (rather
than performance) reasons to use AFTER UPDATE OF ... with subset TTs,
it occurs to me that there may be implications for transition tables
with inheritance.  We decided to disallow transition tables on
non-partition inheritance children anyway (see 501ed02c), but DB2
supports the equivalent.  It has a system of inheritance ("typed
tables", possibly conforming to SQL:1999 though I've never looked into
that) but has different rules about when row triggers and statement
triggers fire when you run DML statements on a table hierarchy.  Don't
quote me but it's something like our rules plus (1) row triggers of
all supertables of an affected table also fire (unless created with
CREATE TRIGGER ... ONLY), and (2) statement triggers of affected
subtables also fire.  Implementing that for our transition tables
would probably require more tuplestores and/or dynamic tuple
conversion and filtering during later scan.  Perhaps AFTER UPDATE OF
column_that_only_this_child_and_its_children_have would fire for
direct and subtable updates but not via-the-supertable updates.  That
is currently completely irrelevant due to our set of supported
features: different firing rules, and prohibition on children with
transition tables.  Some related topics might return soon when people
get more experience with partitions and start wanting to declare row
triggers on partitioned tables (perhaps including foreign key checks)
or implement Kevin's clever batch-mode foreign key check concept.

[1] https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtrigger.html

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

12 September 2017, 03:53:00

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Aside from the RI case, the other user visible change in behaviour
> will be for statements that update the same table via multiple
> ModifyTable nodes (wCTEs).  Our regression test has:

> with wcte as (insert into table1 values (42))
>   insert into table2 values ('hello world');

> ... which demonstrates the fix for the original complaint that table1
> and table2 earlier tried to use the same transition table (and
> crashed).  A new variant inserting into table1 twice would show the
> difference.  Today we get:

> postgres=# with wcte as (insert into table1 values (42))
>              insert into table1 values (43);
> NOTICE:  trigger = table1_trig, new table = (43,)
> NOTICE:  trigger = table1_trig, new table = (42,)
> INSERT 0 1

> Presumably with your change there will be a single transition table
> for inserts into table, holding both (42,) and (43,).  But will we
> fire the trigger once or twice?

Not necessarily.  That would be true only if we allow the WCTE to share
trigger context with the outer query, which I think it does not today.
I've not checked the code, but presumably if we fire the trigger twice
right now, that means there are separate trigger contexts, ie somebody
is calling AfterTriggerBeginQuery/AfterTriggerEndQuery around the WCTE.
If not, maybe we could make it do so.  OTOH, looking at the text of
the spec, I think it's darn hard to justify the behavior shown above.

The reason that the RI case would share trigger context with the outer
query is that we'd *not* call AfterTriggerBeginQuery/AfterTriggerEndQuery
around the RI query, which would be driven by the same skip_trigger
logic that exists today.

>> This seems like rather a mess to implement.  I wonder whether I'm
>> reading it right, and whether other DBMSes actually implement it
>> like that.

> I guess the alternative is storing extra per-tuple metadata,
> transferring more work to the reader.

I really don't want any more per-tuple state.  Adding the TCS link
was costly enough in terms of how big the tuple queue storage is.
        regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

12 September 2017, 04:22:36

On Tue, Sep 12, 2017 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OTOH, looking at the text of
> the spec, I think it's darn hard to justify the behavior shown above.

Yeah.  I assume we always fired statement triggers for each separate
instance of the same table mentioned in a wCTE since they were
invented.  I just confirmed that that is the case in 9.6.  That may
not have been in the spirit of the spec, but it's hard to say because
the spec doesn't have wCTEs IIUC and it mattered less because they
didn't receive any data.

Now that they can optionally see data resulting from modifications, it
seems pretty hard to use this feature to build anything that consumes
the transition data and has to be reliable (matview state,
replication-like systems etc) if we make any choice other than (1)
each instance of a given table fires a statement trigger separately
and sees only the rows it touched, or (2) the statement trigger is
fired once for all instances of a table, and sees all the transition
tuples.  Based on the SQL spec excerpts you've highlighted, I suppose
it seems likely that if the spec had wCTEs it would expect them to
work like (2).

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

14 September 2017, 01:44:02

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Aside from the RI case, the other user visible change in behaviour
> will be for statements that update the same table via multiple
> ModifyTable nodes (wCTEs).  Our regression test has:

> with wcte as (insert into table1 values (42))
>   insert into table2 values ('hello world');

> ... which demonstrates the fix for the original complaint that table1
> and table2 earlier tried to use the same transition table (and
> crashed).

BTW, as I'm digging around in trigger.c, I can't help noticing that
it provides a single "fdw_tuplestore" per trigger query level (a/k/a
trigger execution context).  I've not tried to test this, but it
sure looks like a wCTE like your example above, directed at two
separate foreign tables with triggers, would fail for exactly the
same reason.  That'd be a bug of pretty long standing.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

14 September 2017, 02:00:40

On Thu, Sep 14, 2017 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Aside from the RI case, the other user visible change in behaviour
>> will be for statements that update the same table via multiple
>> ModifyTable nodes (wCTEs).  Our regression test has:
>
>> with wcte as (insert into table1 values (42))
>>   insert into table2 values ('hello world');
>
>> ... which demonstrates the fix for the original complaint that table1
>> and table2 earlier tried to use the same transition table (and
>> crashed).
>
> BTW, as I'm digging around in trigger.c, I can't help noticing that
> it provides a single "fdw_tuplestore" per trigger query level (a/k/a
> trigger execution context).  I've not tried to test this, but it
> sure looks like a wCTE like your example above, directed at two
> separate foreign tables with triggers, would fail for exactly the
> same reason.  That'd be a bug of pretty long standing.

I had the impression that that fdw_tuplestore was doing something a
bit sneaky that actually works out OK: tuples get enqueued and later
dequeued in exactly the same sequence as the after row trigger events
that need them, so even though it seems to violate at least the POLA
if not the spirit of tuplestores by storing tuples of potentially
different types in one tuplestore, nothing bad should happen.  I
suppose it was by copying that coding that Kevin finished up with the
initial bug that wCTEs mix stuff from different wCTEs and it all blows
up, because it has no similar sequencing trick: triggers with
transition tables were seeing all of them, and they weren't even
guaranteed to be of the right type.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

14 September 2017, 02:40:59

On Thu, Sep 14, 2017 at 11:00 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Sep 14, 2017 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, as I'm digging around in trigger.c, I can't help noticing that
>> it provides a single "fdw_tuplestore" per trigger query level (a/k/a
>> trigger execution context).  I've not tried to test this, but it
>> sure looks like a wCTE like your example above, directed at two
>> separate foreign tables with triggers, would fail for exactly the
>> same reason.  That'd be a bug of pretty long standing.
>
> I had the impression that that fdw_tuplestore was doing something a
> bit sneaky that actually works out OK: tuples get enqueued and later
> dequeued in exactly the same sequence as the after row trigger events
> that need them, so even though it seems to violate at least the POLA
> if not the spirit of tuplestores by storing tuples of potentially
> different types in one tuplestore, nothing bad should happen.  I
> suppose it was by copying that coding that Kevin finished up with the
> initial bug that wCTEs mix stuff from different wCTEs and it all blows
> up, because it has no similar sequencing trick: triggers with
> transition tables were seeing all of them, and they weren't even
> guaranteed to be of the right type.

Incidentally, understanding that made me wonder why we don't have a
binary chunk-oriented in-memory-up-to-some-size-then-spill-to-disk
spooling mechanism that could be used for the trigger queue itself
(which currently doesn't know how to spill to disk and therefore can
take your server out), including holding these tuple images directly
(instead of spilling just the tuples in synchronised order with the
in-memory trigger queue).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

14 September 2017, 02:45:46

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Thu, Sep 14, 2017 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, as I'm digging around in trigger.c, I can't help noticing that
>> it provides a single "fdw_tuplestore" per trigger query level (a/k/a
>> trigger execution context).  I've not tried to test this, but it
>> sure looks like a wCTE like your example above, directed at two
>> separate foreign tables with triggers, would fail for exactly the
>> same reason.  That'd be a bug of pretty long standing.

> I had the impression that that fdw_tuplestore was doing something a
> bit sneaky that actually works out OK: tuples get enqueued and later
> dequeued in exactly the same sequence as the after row trigger events
> that need them, so even though it seems to violate at least the POLA
> if not the spirit of tuplestores by storing tuples of potentially
> different types in one tuplestore, nothing bad should happen.

Oh?  Now my fear level is up to 11, because it is completely trivial to
cause triggers to fire in a different order than they were enqueued.
All you need is a mix of deferrable and nondeferrable triggers.

In fact, it also seems entirely broken that a per-query-level tuplestore
is being used at all, because deferrable triggers might not get fired
until some outer query level.

[ Pokes around... ]  Hm, looks like we get around that by forbidding
constraint triggers on foreign tables, but I don't see anything in the
CREATE TRIGGER man page saying that there's such a prohibition.  And
there's certainly no comments in the source code explaining this rickety
set of requirements :-(
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

14 September 2017, 02:54:48

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Incidentally, understanding that made me wonder why we don't have a
> binary chunk-oriented in-memory-up-to-some-size-then-spill-to-disk
> spooling mechanism that could be used for the trigger queue itself
> (which currently doesn't know how to spill to disk and therefore can
> take your server out), including holding these tuple images directly
> (instead of spilling just the tuples in synchronised order with the
> in-memory trigger queue).

The past discussions about spilling the trigger queue have generally
concluded that by the time your event list was long enough to cause
serious pain, you already had a query that was never gonna complete.
That may be getting less true as time goes on, but I'm not sure ---
seems like RAM capacity is growing faster than CPU speed.  Anyway,
that's why it never got done.

Given the addition of transition tables, I suspect there will be
even less motivation to fix it: the right thing to do with mass
updates will be to use a TT with an after-statement trigger, and
that fixes it by putting the bulk data into a spillable tuplestore.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

14 September 2017, 23:43:25

On Tue, Sep 12, 2017 at 1:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Sep 12, 2017 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> OTOH, looking at the text of
>> the spec, I think it's darn hard to justify the behavior shown above.
>
> Yeah.  I assume we always fired statement triggers for each separate
> instance of the same table mentioned in a wCTE since they were
> invented.  I just confirmed that that is the case in 9.6.  That may
> not have been in the spirit of the spec, but it's hard to say because
> the spec doesn't have wCTEs IIUC and it mattered less because they
> didn't receive any data.
>
> Now that they can optionally see data resulting from modifications, it
> seems pretty hard to use this feature to build anything that consumes
> the transition data and has to be reliable (matview state,
> replication-like systems etc) if we make any choice other than (1)
> each instance of a given table fires a statement trigger separately
> and sees only the rows it touched, or (2) the statement trigger is
> fired once for all instances of a table, and sees all the transition
> tuples.  Based on the SQL spec excerpts you've highlighted, I suppose
> it seems likely that if the spec had wCTEs it would expect them to
> work like (2).

So I guess there are about 3 parts to this puzzle:

1.  Merging the transition tables when there are multiple wCTEs
referencing the same table.  Here's one idea:  Rename
MakeTransitionCaptureState() to GetTransitionCaptureState() and use a
hash table keyed by table OID in
afterTriggers.transition_capture_states[afterTriggers.query_depth] to
find the TCS for the given TriggerDesc or create it if not found, so
that all wCTEs find the same TransitionCaptureState object.  The all
existing callers continue to do what they're doing now, but they'll be
sharing TCSs appropriately with other things in the plan.  Note that
TransitionCaptureState already holds tuplestores for each operation
(INSERT, UPDATE, DELETE) so the OID of the table alone is a suitable
key for the hash table (assuming we are ignoring the column-list part
of the spec as you suggested).

2.  Hiding the fact that we implement fk CASCADE using another level
of queries.   Perhaps we could arrange for
afterTriggers.transition_capture_states[afterTriggers.query_depth] to
point to the same hash table as query_depth - 1, so that the effects
of statements at this implementation-internal level appear to the user
as part of the the level below?

3.  Merging the invocation after statement firing so that if you
updated the same table directly and also via a wCTE and also
indirectly via fk ON DELETE/UPDATE trigger then you still only get one
invocation of the after statement trigger.  Not sure exactly how...
perhaps using a flag in the TransitionCaptureState to prevent multiple
firings.  As I argued in the above-quoted email, if we've merged the
transition tables then we'll need to merge the trigger firing too or
it won't be possible to make coherent integrity, summary, replication
etc systems using TT triggers (even though that's a user-visible
change in after statement firing behaviour for wCTEs compared to
earlier releases).

Does this make any sense?

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

15 September 2017, 00:10:25

On Fri, Sep 15, 2017 at 8:43 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Note that
> TransitionCaptureState already holds tuplestores for each operation
> (INSERT, UPDATE, DELETE)

Erm, that's not quite true -- it only separates INSERT and UPDATE for
now.  It would need to be true, so it would need to gain one more to
have the full set.

> ... perhaps using a flag in the TransitionCaptureState to prevent multiple
> firings.

Of course that would need to be per-trigger, not just one flag per
TCS.  Also the change in firing rules for multiply-referenced tables
would apply also when there are no TTs involved, so perhaps TCS is not
a good place for that state.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

15 September 2017, 00:45:55

Attached is a draft patch for this.

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> 1.  Merging the transition tables when there are multiple wCTEs
> referencing the same table.  Here's one idea:  Rename
> MakeTransitionCaptureState() to GetTransitionCaptureState() and use a
> hash table keyed by table OID in
> afterTriggers.transition_capture_states[afterTriggers.query_depth] to
> find the TCS for the given TriggerDesc or create it if not found, so
> that all wCTEs find the same TransitionCaptureState object.  The all
> existing callers continue to do what they're doing now, but they'll be
> sharing TCSs appropriately with other things in the plan.  Note that
> TransitionCaptureState already holds tuplestores for each operation
> (INSERT, UPDATE, DELETE) so the OID of the table alone is a suitable
> key for the hash table (assuming we are ignoring the column-list part
> of the spec as you suggested).

It seems unsafe to merge the TCS objects themselves, because the callers
assume that they can munge the tcs_map and tcs_original_insert_tuple
fields freely without regard for any other callers.  So as I have it,
we still have a TCS for each caller, but the TCSes point at tuplestores
that can be shared across multiple callers for the same event type.
The tuplestores themselves are managed by the AfterTrigger data
structures.  Also, because the TCS structs are just throwaway per-caller
data, it's uncool to reference them in the trigger event lists.
So I replaced ats_transition_capture with two pointers to the actual
tuplestores.  That bloats AfterTriggerSharedData a bit but I think it's
okay; we don't expect a lot of those structs in a normal query.

I chose to make the persistent state (AfterTriggersTableData) independent
for each operation type.  We could have done that differently perhaps, but
it seemed more complicated and less likely to match the spec's semantics.

The INSERT ON CONFLICT UPDATE mess is handled by creating two separate
TCSes with two different underlying AfterTriggersTableData structs.
The insertion tuplestore sees only the inserted tuples, the update
tuplestores see only the updated-pre-existing tuples.  That adds a little
code to nodeModifyTable but it seems conceptually much cleaner.

> 2.  Hiding the fact that we implement fk CASCADE using another level
> of queries.   Perhaps we could arrange for
> afterTriggers.transition_capture_states[afterTriggers.query_depth] to
> point to the same hash table as query_depth - 1, so that the effects
> of statements at this implementation-internal level appear to the user
> as part of the the level below?

That already happens, because query_depth doesn't increment for an FK
enforcement query --- we never call AfterTriggerBegin/EndQuery for it.

> 3.  Merging the invocation after statement firing so that if you
> updated the same table directly and also via a wCTE and also
> indirectly via fk ON DELETE/UPDATE trigger then you still only get one
> invocation of the after statement trigger.  Not sure exactly how...

What I did here was to use the AfterTriggersTableData structs to hold
a flag saying we'd already queued statement triggers for this rel and
cmdType.  There's probably more than one way to do that, but this seemed
convenient.

One thing I don't like too much about that is that it means there are
cases where the single statement trigger firing would occur before some
AFTER ROW trigger firings.  Not sure if we promise anything about the
ordering in the docs.  It looks quite expensive/complicated to try to
make it always happen afterwards, though, and it might well be totally
impossible if triggers cause more table updates to occur.

Because MakeTransitionCaptureState now depends on the trigger query
level being active, I had to relocate the AfterTriggerBeginQuery calls
to occur before it.

In passing, I refactored the AfterTriggers data structures a bit so
that we don't need to do as many palloc calls to manage them.  Instead
of several independent arrays there's now one array of structs.

            regards, tom lane

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..c6fa445 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*************** CopyFrom(CopyState cstate)
*** 2432,2443 ****
      /* Triggers might need a slot as well */
      estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);

      /*
       * If there are any triggers with transition tables on the named relation,
       * we need to be prepared to capture transition tuples.
       */
      cstate->transition_capture =
!         MakeTransitionCaptureState(cstate->rel->trigdesc);

      /*
       * If the named relation is a partitioned table, initialize state for
--- 2432,2448 ----
      /* Triggers might need a slot as well */
      estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);

+     /* Prepare to catch AFTER triggers. */
+     AfterTriggerBeginQuery();
+
      /*
       * If there are any triggers with transition tables on the named relation,
       * we need to be prepared to capture transition tuples.
       */
      cstate->transition_capture =
!         MakeTransitionCaptureState(cstate->rel->trigdesc,
!                                    RelationGetRelid(cstate->rel),
!                                    CMD_INSERT);

      /*
       * If the named relation is a partitioned table, initialize state for
*************** CopyFrom(CopyState cstate)
*** 2513,2521 ****
          bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
      }

-     /* Prepare to catch AFTER triggers. */
-     AfterTriggerBeginQuery();
-
      /*
       * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
       * should do this for COPY, since it's not really an "INSERT" statement as
--- 2518,2523 ----
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 269c9e1..ee2ff04 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** CreateTrigger(CreateTrigStmt *stmt, cons
*** 234,239 ****
--- 234,244 ----
                              RelationGetRelationName(rel)),
                       errdetail("Foreign tables cannot have TRUNCATE triggers.")));

+         /*
+          * We disallow constraint triggers to protect the assumption that
+          * triggers on FKs can't be deferred.  See notes with AfterTriggers
+          * data structures, below.
+          */
          if (stmt->isconstraint)
              ereport(ERROR,
                      (errcode(ERRCODE_WRONG_OBJECT_TYPE),
*************** CreateTrigger(CreateTrigStmt *stmt, cons
*** 418,423 ****
--- 423,448 ----
                          (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                           errmsg("transition tables cannot be specified for triggers with more than one event")));

+             /*
+              * We currently don't allow column-specific triggers with
+              * transition tables.  Per spec, that seems to require
+              * accumulating separate transition tables for each combination of
+              * columns, which is a lot of work for a rather marginal feature.
+              */
+             if (stmt->columns != NIL)
+                 ereport(ERROR,
+                         (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                          errmsg("transition tables cannot be specified for triggers with column lists")));
+
+             /*
+              * We disallow constraint triggers with transition tables, to
+              * protect the assumption that such triggers can't be deferred.
+              * See notes with AfterTriggers data structures, below.
+              *
+              * Currently this is enforced by the grammar, so just Assert here.
+              */
+             Assert(!stmt->isconstraint);
+
              if (tt->isNew)
              {
                  if (!(TRIGGER_FOR_INSERT(tgtype) ||
*************** FindTriggerIncompatibleWithInheritance(T
*** 2086,2181 ****
  }

  /*
-  * Make a TransitionCaptureState object from a given TriggerDesc.  The
-  * resulting object holds the flags which control whether transition tuples
-  * are collected when tables are modified, and the tuplestores themselves.
-  * Note that we copy the flags from a parent table into this struct (rather
-  * than using each relation's TriggerDesc directly) so that we can use it to
-  * control the collection of transition tuples from child tables.
-  *
-  * If there are no triggers with transition tables configured for 'trigdesc',
-  * then return NULL.
-  *
-  * The resulting object can be passed to the ExecAR* functions.  The caller
-  * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
-  * with child tables.
-  */
- TransitionCaptureState *
- MakeTransitionCaptureState(TriggerDesc *trigdesc)
- {
-     TransitionCaptureState *state = NULL;
-
-     if (trigdesc != NULL &&
-         (trigdesc->trig_delete_old_table || trigdesc->trig_update_old_table ||
-          trigdesc->trig_update_new_table || trigdesc->trig_insert_new_table))
-     {
-         MemoryContext oldcxt;
-         ResourceOwner saveResourceOwner;
-
-         /*
-          * Normally DestroyTransitionCaptureState should be called after
-          * executing all AFTER triggers for the current statement.
-          *
-          * To handle error cleanup, TransitionCaptureState and the tuplestores
-          * it contains will live in the current [sub]transaction's memory
-          * context.  Likewise for the current resource owner, because we also
-          * want to clean up temporary files spilled to disk by the tuplestore
-          * in that scenario.  This scope is sufficient, because AFTER triggers
-          * with transition tables cannot be deferred (only constraint triggers
-          * can be deferred, and constraint triggers cannot have transition
-          * tables).  The AFTER trigger queue may contain pointers to this
-          * TransitionCaptureState, but any such entries will be processed or
-          * discarded before the end of the current [sub]transaction.
-          *
-          * If a future release allows deferred triggers with transition
-          * tables, we'll need to reconsider the scope of the
-          * TransitionCaptureState object.
-          */
-         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
-         saveResourceOwner = CurrentResourceOwner;
-
-         state = (TransitionCaptureState *)
-             palloc0(sizeof(TransitionCaptureState));
-         state->tcs_delete_old_table = trigdesc->trig_delete_old_table;
-         state->tcs_update_old_table = trigdesc->trig_update_old_table;
-         state->tcs_update_new_table = trigdesc->trig_update_new_table;
-         state->tcs_insert_new_table = trigdesc->trig_insert_new_table;
-         PG_TRY();
-         {
-             CurrentResourceOwner = CurTransactionResourceOwner;
-             if (trigdesc->trig_delete_old_table || trigdesc->trig_update_old_table)
-                 state->tcs_old_tuplestore = tuplestore_begin_heap(false, false, work_mem);
-             if (trigdesc->trig_insert_new_table)
-                 state->tcs_insert_tuplestore = tuplestore_begin_heap(false, false, work_mem);
-             if (trigdesc->trig_update_new_table)
-                 state->tcs_update_tuplestore = tuplestore_begin_heap(false, false, work_mem);
-         }
-         PG_CATCH();
-         {
-             CurrentResourceOwner = saveResourceOwner;
-             PG_RE_THROW();
-         }
-         PG_END_TRY();
-         CurrentResourceOwner = saveResourceOwner;
-         MemoryContextSwitchTo(oldcxt);
-     }
-
-     return state;
- }
-
- void
- DestroyTransitionCaptureState(TransitionCaptureState *tcs)
- {
-     if (tcs->tcs_insert_tuplestore != NULL)
-         tuplestore_end(tcs->tcs_insert_tuplestore);
-     if (tcs->tcs_update_tuplestore != NULL)
-         tuplestore_end(tcs->tcs_update_tuplestore);
-     if (tcs->tcs_old_tuplestore != NULL)
-         tuplestore_end(tcs->tcs_old_tuplestore);
-     pfree(tcs);
- }
-
- /*
   * Call a trigger function.
   *
   *        trigdata: trigger descriptor.
--- 2111,2116 ----
*************** TriggerEnabled(EState *estate, ResultRel
*** 3338,3346 ****
   * during the current transaction tree.  (BEFORE triggers are fired
   * immediately so we don't need any persistent state about them.)  The struct
   * and most of its subsidiary data are kept in TopTransactionContext; however
!  * the individual event records are kept in a separate sub-context.  This is
!  * done mainly so that it's easy to tell from a memory context dump how much
!  * space is being eaten by trigger events.
   *
   * Because the list of pending events can grow large, we go to some
   * considerable effort to minimize per-event memory consumption.  The event
--- 3273,3283 ----
   * during the current transaction tree.  (BEFORE triggers are fired
   * immediately so we don't need any persistent state about them.)  The struct
   * and most of its subsidiary data are kept in TopTransactionContext; however
!  * some data that can be discarded sooner appears in the CurTransactionContext
!  * of the relevant subtransaction.  Also, the individual event records are
!  * kept in a separate sub-context of TopTransactionContext.  This is done
!  * mainly so that it's easy to tell from a memory context dump how much space
!  * is being eaten by trigger events.
   *
   * Because the list of pending events can grow large, we go to some
   * considerable effort to minimize per-event memory consumption.  The event
*************** typedef SetConstraintStateData *SetConst
*** 3400,3405 ****
--- 3337,3349 ----
   * tuple(s).  This permits storing tuples once regardless of the number of
   * row-level triggers on a foreign table.
   *
+  * Note that we need triggers on foreign tables to be fired in exactly the
+  * order they were queued, so that the tuples come out of the tuplestore in
+  * the right order.  To ensure that, we forbid deferrable (constraint)
+  * triggers on foreign tables.  This also ensures that such triggers do not
+  * get deferred into outer trigger query levels, meaning that it's okay to
+  * destroy the tuplestore at the end of the query level.
+  *
   * Statement-level triggers always bear AFTER_TRIGGER_1CTID, though they
   * require no ctid field.  We lack the flag bit space to neatly represent that
   * distinct case, and it seems unlikely to be worth much trouble.
*************** typedef struct AfterTriggerSharedData
*** 3433,3439 ****
      Oid            ats_tgoid;        /* the trigger's ID */
      Oid            ats_relid;        /* the relation it's on */
      CommandId    ats_firing_id;    /* ID for firing cycle */
!     TransitionCaptureState *ats_transition_capture;
  } AfterTriggerSharedData;

  typedef struct AfterTriggerEventData *AfterTriggerEvent;
--- 3377,3384 ----
      Oid            ats_tgoid;        /* the trigger's ID */
      Oid            ats_relid;        /* the relation it's on */
      CommandId    ats_firing_id;    /* ID for firing cycle */
!     Tuplestorestate *ats_old_tuplestore;    /* possible OLD transition table */
!     Tuplestorestate *ats_new_tuplestore;    /* possible NEW transition table */
  } AfterTriggerSharedData;

  typedef struct AfterTriggerEventData *AfterTriggerEvent;
*************** typedef struct AfterTriggerEventList
*** 3529,3588 ****
   * query_depth is the current depth of nested AfterTriggerBeginQuery calls
   * (-1 when the stack is empty).
   *
!  * query_stack[query_depth] is a list of AFTER trigger events queued by the
!  * current query (and the query_stack entries below it are lists of trigger
!  * events queued by calling queries).  None of these are valid until the
!  * matching AfterTriggerEndQuery call occurs.  At that point we fire
!  * immediate-mode triggers, and append any deferred events to the main events
!  * list.
   *
!  * fdw_tuplestores[query_depth] is a tuplestore containing the foreign tuples
!  * needed for the current query.
   *
!  * maxquerydepth is just the allocated length of query_stack and the
!  * tuplestores.
   *
!  * state_stack is a stack of pointers to saved copies of the SET CONSTRAINTS
!  * state data; each subtransaction level that modifies that state first
   * saves a copy, which we use to restore the state if we abort.
   *
!  * events_stack is a stack of copies of the events head/tail pointers,
   * which we use to restore those values during subtransaction abort.
   *
!  * depth_stack is a stack of copies of subtransaction-start-time query_depth,
   * which we similarly use to clean up at subtransaction abort.
   *
!  * firing_stack is a stack of copies of subtransaction-start-time
!  * firing_counter.  We use this to recognize which deferred triggers were
!  * fired (or marked for firing) within an aborted subtransaction.
   *
   * We use GetCurrentTransactionNestLevel() to determine the correct array
!  * index in these stacks.  maxtransdepth is the number of allocated entries in
!  * each stack.  (By not keeping our own stack pointer, we can avoid trouble
   * in cases where errors during subxact abort cause multiple invocations
   * of AfterTriggerEndSubXact() at the same nesting depth.)
   */
  typedef struct AfterTriggersData
  {
      CommandId    firing_counter; /* next firing ID to assign */
      SetConstraintState state;    /* the active S C state */
      AfterTriggerEventList events;    /* deferred-event list */
-     int            query_depth;    /* current query list index */
-     AfterTriggerEventList *query_stack; /* events pending from each query */
-     Tuplestorestate **fdw_tuplestores;    /* foreign tuples for one row from
-                                          * each query */
-     int            maxquerydepth;    /* allocated len of above array */
      MemoryContext event_cxt;    /* memory context for events, if any */

!     /* these fields are just for resetting at subtrans abort: */

!     SetConstraintState *state_stack;    /* stacked S C states */
!     AfterTriggerEventList *events_stack;    /* stacked list pointers */
!     int           *depth_stack;    /* stacked query_depths */
!     CommandId  *firing_stack;    /* stacked firing_counters */
!     int            maxtransdepth;    /* allocated len of above arrays */
  } AfterTriggersData;

  static AfterTriggersData afterTriggers;

  static void AfterTriggerExecute(AfterTriggerEvent event,
--- 3474,3579 ----
   * query_depth is the current depth of nested AfterTriggerBeginQuery calls
   * (-1 when the stack is empty).
   *
!  * query_stack[query_depth] is the per-query-level data, including these fields:
   *
!  * events is a list of AFTER trigger events queued by the current query.
!  * None of these are valid until the matching AfterTriggerEndQuery call
!  * occurs.  At that point we fire immediate-mode triggers, and append any
!  * deferred events to the main events list.
   *
!  * fdw_tuplestore is a tuplestore containing the foreign-table tuples
!  * needed by events queued by the current query.  (Note: we use just one
!  * tuplestore even though more than one foreign table might be involved.
!  * This is okay because tuplestores don't really care what's in the tuples
!  * they store; but it's possible that someday it'd break.)
   *
!  * tables is a List of AfterTriggersTableData structs for target tables
!  * of the current query (see below).
!  *
!  * maxquerydepth is just the allocated length of query_stack.
!  *
!  * trans_stack holds per-subtransaction data, including these fields:
!  *
!  * state is NULL or a pointer to a saved copy of the SET CONSTRAINTS
!  * state data.  Each subtransaction level that modifies that state first
   * saves a copy, which we use to restore the state if we abort.
   *
!  * events is a copy of the events head/tail pointers,
   * which we use to restore those values during subtransaction abort.
   *
!  * query_depth is the subtransaction-start-time value of query_depth,
   * which we similarly use to clean up at subtransaction abort.
   *
!  * firing_counter is the subtransaction-start-time value of firing_counter.
!  * We use this to recognize which deferred triggers were fired (or marked
!  * for firing) within an aborted subtransaction.
   *
   * We use GetCurrentTransactionNestLevel() to determine the correct array
!  * index in trans_stack.  maxtransdepth is the number of allocated entries in
!  * trans_stack.  (By not keeping our own stack pointer, we can avoid trouble
   * in cases where errors during subxact abort cause multiple invocations
   * of AfterTriggerEndSubXact() at the same nesting depth.)
+  *
+  * We create an AfterTriggersTableData struct for each target table of the
+  * current query, and each operation mode (INSERT/UPDATE/DELETE), that has
+  * either transition tables or AFTER STATEMENT triggers.  This is used to
+  * hold the relevant transition tables, as well as a flag showing whether
+  * we already queued the AFTER STATEMENT triggers.  We need the flag so
+  * that cases like multiple FK enforcement sub-queries targeting the same
+  * table don't fire such triggers more than once.  These structs, along with
+  * the transition table tuplestores, live in the (sub)transaction's
+  * CurTransactionContext.  That's sufficient lifespan because we don't allow
+  * transition tables to be used by deferrable triggers, so they only need
+  * to survive until AfterTriggerEndQuery.
   */
+ typedef struct AfterTriggersQueryData AfterTriggersQueryData;
+ typedef struct AfterTriggersTransData AfterTriggersTransData;
+ typedef struct AfterTriggersTableData AfterTriggersTableData;
+
  typedef struct AfterTriggersData
  {
      CommandId    firing_counter; /* next firing ID to assign */
      SetConstraintState state;    /* the active S C state */
      AfterTriggerEventList events;    /* deferred-event list */
      MemoryContext event_cxt;    /* memory context for events, if any */

!     /* per-query-level data: */
!     AfterTriggersQueryData *query_stack;    /* array of structs shown above */
!     int            query_depth;    /* current index in above array */
!     int            maxquerydepth;    /* allocated len of above array */

!     /* per-subtransaction-level data: */
!     AfterTriggersTransData *trans_stack;    /* array of structs shown above */
!     int            maxtransdepth;    /* allocated len of above array */
  } AfterTriggersData;

+ struct AfterTriggersQueryData
+ {
+     AfterTriggerEventList events;    /* events pending from this query */
+     Tuplestorestate *fdw_tuplestore;    /* foreign tuples for said events */
+     List       *tables;            /* list of AfterTriggersTableData */
+ };
+
+ struct AfterTriggersTransData
+ {
+     /* these fields are just for resetting at subtrans abort: */
+     SetConstraintState state;    /* saved S C state, or NULL if not yet saved */
+     AfterTriggerEventList events;    /* saved list pointer */
+     int            query_depth;    /* saved query_depth */
+     CommandId    firing_counter; /* saved firing_counter */
+ };
+
+ struct AfterTriggersTableData
+ {
+     /* relid + cmdType form the lookup key for these structs: */
+     Oid            relid;            /* target table's OID */
+     CmdType        cmdType;        /* event type, CMD_INSERT/UPDATE/DELETE */
+     bool        closed;            /* true when no longer OK to add tuples */
+     bool        stmt_trig_done; /* did we already queue stmt-level triggers? */
+     Tuplestorestate *old_tuplestore;    /* "old" transition table, if any */
+     Tuplestorestate *new_tuplestore;    /* "new" transition table, if any */
+ };
+
  static AfterTriggersData afterTriggers;

  static void AfterTriggerExecute(AfterTriggerEvent event,
*************** static void AfterTriggerExecute(AfterTri
*** 3591,3598 ****
                      Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2,
!                     TransitionCaptureState *transition_capture);
  static SetConstraintState SetConstraintStateCreate(int numalloc);
  static SetConstraintState SetConstraintStateCopy(SetConstraintState state);
  static SetConstraintState SetConstraintStateAddItem(SetConstraintState state,
--- 3582,3591 ----
                      Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2);
! static AfterTriggersTableData *GetAfterTriggersTableData(Oid relid,
!                           CmdType cmdType);
! static void AfterTriggerFreeQuery(AfterTriggersQueryData *qs);
  static SetConstraintState SetConstraintStateCreate(int numalloc);
  static SetConstraintState SetConstraintStateCopy(SetConstraintState state);
  static SetConstraintState SetConstraintStateAddItem(SetConstraintState state,
*************** static SetConstraintState SetConstraintS
*** 3600,3628 ****


  /*
!  * Gets a current query transition tuplestore and initializes it if necessary.
   */
  static Tuplestorestate *
! GetTriggerTransitionTuplestore(Tuplestorestate **tss)
  {
      Tuplestorestate *ret;

!     ret = tss[afterTriggers.query_depth];
      if (ret == NULL)
      {
          MemoryContext oldcxt;
          ResourceOwner saveResourceOwner;

          /*
!          * Make the tuplestore valid until end of transaction.  This is the
!          * allocation lifespan of the associated events list, but we really
           * only need it until AfterTriggerEndQuery().
           */
!         oldcxt = MemoryContextSwitchTo(TopTransactionContext);
          saveResourceOwner = CurrentResourceOwner;
          PG_TRY();
          {
!             CurrentResourceOwner = TopTransactionResourceOwner;
              ret = tuplestore_begin_heap(false, false, work_mem);
          }
          PG_CATCH();
--- 3593,3621 ----


  /*
!  * Get the FDW tuplestore for the current trigger query level, creating it
!  * if necessary.
   */
  static Tuplestorestate *
! GetCurrentFDWTuplestore(void)
  {
      Tuplestorestate *ret;

!     ret = afterTriggers.query_stack[afterTriggers.query_depth].fdw_tuplestore;
      if (ret == NULL)
      {
          MemoryContext oldcxt;
          ResourceOwner saveResourceOwner;

          /*
!          * Make the tuplestore valid until end of subtransaction.  We really
           * only need it until AfterTriggerEndQuery().
           */
!         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
          saveResourceOwner = CurrentResourceOwner;
          PG_TRY();
          {
!             CurrentResourceOwner = CurTransactionResourceOwner;
              ret = tuplestore_begin_heap(false, false, work_mem);
          }
          PG_CATCH();
*************** GetTriggerTransitionTuplestore(Tuplestor
*** 3634,3640 ****
          CurrentResourceOwner = saveResourceOwner;
          MemoryContextSwitchTo(oldcxt);

!         tss[afterTriggers.query_depth] = ret;
      }

      return ret;
--- 3627,3633 ----
          CurrentResourceOwner = saveResourceOwner;
          MemoryContextSwitchTo(oldcxt);

!         afterTriggers.query_stack[afterTriggers.query_depth].fdw_tuplestore = ret;
      }

      return ret;
*************** afterTriggerAddEvent(AfterTriggerEventLi
*** 3780,3786 ****
          if (newshared->ats_tgoid == evtshared->ats_tgoid &&
              newshared->ats_relid == evtshared->ats_relid &&
              newshared->ats_event == evtshared->ats_event &&
!             newshared->ats_transition_capture == evtshared->ats_transition_capture &&
              newshared->ats_firing_id == 0)
              break;
      }
--- 3773,3780 ----
          if (newshared->ats_tgoid == evtshared->ats_tgoid &&
              newshared->ats_relid == evtshared->ats_relid &&
              newshared->ats_event == evtshared->ats_event &&
!             newshared->ats_old_tuplestore == evtshared->ats_old_tuplestore &&
!             newshared->ats_new_tuplestore == evtshared->ats_new_tuplestore &&
              newshared->ats_firing_id == 0)
              break;
      }
*************** AfterTriggerExecute(AfterTriggerEvent ev
*** 3892,3899 ****
                      FmgrInfo *finfo, Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2,
!                     TransitionCaptureState *transition_capture)
  {
      AfterTriggerShared evtshared = GetTriggerSharedData(event);
      Oid            tgoid = evtshared->ats_tgoid;
--- 3886,3892 ----
                      FmgrInfo *finfo, Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2)
  {
      AfterTriggerShared evtshared = GetTriggerSharedData(event);
      Oid            tgoid = evtshared->ats_tgoid;
*************** AfterTriggerExecute(AfterTriggerEvent ev
*** 3934,3942 ****
      {
          case AFTER_TRIGGER_FDW_FETCH:
              {
!                 Tuplestorestate *fdw_tuplestore =
!                 GetTriggerTransitionTuplestore
!                 (afterTriggers.fdw_tuplestores);

                  if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
                                               trig_tuple_slot1))
--- 3927,3933 ----
      {
          case AFTER_TRIGGER_FDW_FETCH:
              {
!                 Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();

                  if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
                                               trig_tuple_slot1))
*************** AfterTriggerExecute(AfterTriggerEvent ev
*** 4008,4043 ****
      /*
       * Set up the tuplestore information.
       */
!     LocTriggerData.tg_oldtable = LocTriggerData.tg_newtable = NULL;
!     if (transition_capture != NULL)
!     {
!         if (LocTriggerData.tg_trigger->tgoldtable)
!             LocTriggerData.tg_oldtable = transition_capture->tcs_old_tuplestore;
!         if (LocTriggerData.tg_trigger->tgnewtable)
!         {
!             /*
!              * Currently a trigger with transition tables may only be defined
!              * for a single event type (here AFTER INSERT or AFTER UPDATE, but
!              * not AFTER INSERT OR ...).
!              */
!             Assert((TRIGGER_FOR_INSERT(LocTriggerData.tg_trigger->tgtype) != 0) ^
!                    (TRIGGER_FOR_UPDATE(LocTriggerData.tg_trigger->tgtype) != 0));

!             /*
!              * Show either the insert or update new tuple images, depending on
!              * which event type the trigger was registered for.  A single
!              * statement may have produced both in the case of INSERT ... ON
!              * CONFLICT ... DO UPDATE, and in that case the event determines
!              * which tuplestore the trigger sees as the NEW TABLE.
!              */
!             if (TRIGGER_FOR_INSERT(LocTriggerData.tg_trigger->tgtype))
!                 LocTriggerData.tg_newtable =
!                     transition_capture->tcs_insert_tuplestore;
!             else
!                 LocTriggerData.tg_newtable =
!                     transition_capture->tcs_update_tuplestore;
!         }
!     }

      /*
       * Setup the remaining trigger information
--- 3999,4013 ----
      /*
       * Set up the tuplestore information.
       */
!     if (LocTriggerData.tg_trigger->tgoldtable)
!         LocTriggerData.tg_oldtable = evtshared->ats_old_tuplestore;
!     else
!         LocTriggerData.tg_oldtable = NULL;

!     if (LocTriggerData.tg_trigger->tgnewtable)
!         LocTriggerData.tg_newtable = evtshared->ats_new_tuplestore;
!     else
!         LocTriggerData.tg_newtable = NULL;

      /*
       * Setup the remaining trigger information
*************** afterTriggerInvokeEvents(AfterTriggerEve
*** 4245,4252 ****
                   * won't try to re-fire it.
                   */
                  AfterTriggerExecute(event, rel, trigdesc, finfo, instr,
!                                     per_tuple_context, slot1, slot2,
!                                     evtshared->ats_transition_capture);

                  /*
                   * Mark the event as done.
--- 4215,4221 ----
                   * won't try to re-fire it.
                   */
                  AfterTriggerExecute(event, rel, trigdesc, finfo, instr,
!                                     per_tuple_context, slot1, slot2);

                  /*
                   * Mark the event as done.
*************** afterTriggerInvokeEvents(AfterTriggerEve
*** 4296,4301 ****
--- 4265,4433 ----
  }


+ /*
+  * GetAfterTriggersTableData
+  *
+  * Find or create an AfterTriggersTableData struct for the specified
+  * trigger event (relation + operation type).  Ignore existing structs
+  * marked "closed"; we don't want to put any additional tuples into them,
+  * nor change their triggers-fired flags.
+  *
+  * Note: the AfterTriggersTableData list is allocated in the current
+  * (sub)transaction's CurTransactionContext.  This is OK because
+  * we don't need it to live past AfterTriggerEndQuery.
+  */
+ static AfterTriggersTableData *
+ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
+ {
+     AfterTriggersTableData *table;
+     AfterTriggersQueryData *qs;
+     MemoryContext oldcxt;
+     ListCell   *lc;
+
+     /* Caller should have ensured query_depth is OK. */
+     Assert(afterTriggers.query_depth >= 0 &&
+            afterTriggers.query_depth < afterTriggers.maxquerydepth);
+     qs = &afterTriggers.query_stack[afterTriggers.query_depth];
+
+     foreach(lc, qs->tables)
+     {
+         table = (AfterTriggersTableData *) lfirst(lc);
+         if (table->relid == relid && table->cmdType == cmdType &&
+             !table->closed)
+             return table;
+     }
+
+     oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+
+     table = (AfterTriggersTableData *) palloc0(sizeof(AfterTriggersTableData));
+     table->relid = relid;
+     table->cmdType = cmdType;
+     qs->tables = lappend(qs->tables, table);
+
+     MemoryContextSwitchTo(oldcxt);
+
+     return table;
+ }
+
+
+ /*
+  * MakeTransitionCaptureState
+  *
+  * Make a TransitionCaptureState object for the given TriggerDesc, target
+  * relation, and operation type.  The TCS object holds all the state needed
+  * to decide whether to capture tuples in transition tables.
+  *
+  * If there are no triggers in 'trigdesc' that request relevant transition
+  * tables, then return NULL.
+  *
+  * The resulting object can be passed to the ExecAR* functions.  The caller
+  * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
+  * with child tables.
+  *
+  * Note that we copy the flags from a parent table into this struct (rather
+  * than subsequently using the relation's TriggerDesc directly) so that we can
+  * use it to control collection of transition tuples from child tables.
+  *
+  * Per SQL spec, all operations of the same kind (INSERT/UPDATE/DELETE)
+  * on the same table during one query should share one transition table.
+  * Therefore, the Tuplestores are owned by an AfterTriggersTableData struct
+  * looked up using the table OID + CmdType, and are merely referenced by
+  * the TransitionCaptureState objects we hand out to callers.
+  */
+ TransitionCaptureState *
+ MakeTransitionCaptureState(TriggerDesc *trigdesc, Oid relid, CmdType cmdType)
+ {
+     TransitionCaptureState *state;
+     bool        need_old,
+                 need_new;
+     AfterTriggersTableData *table;
+     MemoryContext oldcxt;
+     ResourceOwner saveResourceOwner;
+
+     if (trigdesc == NULL)
+         return NULL;
+
+     /* Detect which table(s) we need. */
+     switch (cmdType)
+     {
+         case CMD_INSERT:
+             need_old = false;
+             need_new = trigdesc->trig_insert_new_table;
+             break;
+         case CMD_UPDATE:
+             need_old = trigdesc->trig_update_old_table;
+             need_new = trigdesc->trig_update_new_table;
+             break;
+         case CMD_DELETE:
+             need_old = trigdesc->trig_delete_old_table;
+             need_new = false;
+             break;
+         default:
+             elog(ERROR, "unexpected CmdType: %d", (int) cmdType);
+             need_old = need_new = false;    /* keep compiler quiet */
+             break;
+     }
+     if (!need_old && !need_new)
+         return NULL;
+
+     /* Check state, like AfterTriggerSaveEvent. */
+     if (afterTriggers.query_depth < 0)
+         elog(ERROR, "MakeTransitionCaptureState() called outside of query");
+
+     /* Be sure we have enough space to record events at this query depth. */
+     if (afterTriggers.query_depth >= afterTriggers.maxquerydepth)
+         AfterTriggerEnlargeQueryState();
+
+     /*
+      * Find or create an AfterTriggersTableData struct to hold the
+      * tuplestore(s).  If there's a matching struct but it's marked closed,
+      * ignore it; we need a newer one.
+      *
+      * Note: the AfterTriggersTableData list, as well as the tuplestores, are
+      * allocated in the current (sub)transaction's CurTransactionContext, and
+      * the tuplestores are managed by the (sub)transaction's resource owner.
+      * This is sufficient lifespan because we do not allow triggers using
+      * transition tables to be deferrable; they will be fired during
+      * AfterTriggerEndQuery, after which it's okay to delete the data.
+      */
+     table = GetAfterTriggersTableData(relid, cmdType);
+
+     /* Now create required tuplestore(s), if we don't have them already. */
+     oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+     saveResourceOwner = CurrentResourceOwner;
+     PG_TRY();
+     {
+         CurrentResourceOwner = CurTransactionResourceOwner;
+         if (need_old && table->old_tuplestore == NULL)
+             table->old_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+         if (need_new && table->new_tuplestore == NULL)
+             table->new_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+     }
+     PG_CATCH();
+     {
+         CurrentResourceOwner = saveResourceOwner;
+         PG_RE_THROW();
+     }
+     PG_END_TRY();
+     CurrentResourceOwner = saveResourceOwner;
+     MemoryContextSwitchTo(oldcxt);
+
+     /* Now build the TransitionCaptureState struct, in caller's context */
+     state = (TransitionCaptureState *) palloc0(sizeof(TransitionCaptureState));
+     state->tcs_delete_old_table = trigdesc->trig_delete_old_table;
+     state->tcs_update_old_table = trigdesc->trig_update_old_table;
+     state->tcs_update_new_table = trigdesc->trig_update_new_table;
+     state->tcs_insert_new_table = trigdesc->trig_insert_new_table;
+     if (need_old)
+         state->tcs_old_tuplestore = table->old_tuplestore;
+     if (need_new)
+         state->tcs_new_tuplestore = table->new_tuplestore;
+
+     return state;
+ }
+
+
  /* ----------
   * AfterTriggerBeginXact()
   *
*************** AfterTriggerBeginXact(void)
*** 4319,4332 ****
       */
      Assert(afterTriggers.state == NULL);
      Assert(afterTriggers.query_stack == NULL);
-     Assert(afterTriggers.fdw_tuplestores == NULL);
      Assert(afterTriggers.maxquerydepth == 0);
      Assert(afterTriggers.event_cxt == NULL);
      Assert(afterTriggers.events.head == NULL);
!     Assert(afterTriggers.state_stack == NULL);
!     Assert(afterTriggers.events_stack == NULL);
!     Assert(afterTriggers.depth_stack == NULL);
!     Assert(afterTriggers.firing_stack == NULL);
      Assert(afterTriggers.maxtransdepth == 0);
  }

--- 4451,4460 ----
       */
      Assert(afterTriggers.state == NULL);
      Assert(afterTriggers.query_stack == NULL);
      Assert(afterTriggers.maxquerydepth == 0);
      Assert(afterTriggers.event_cxt == NULL);
      Assert(afterTriggers.events.head == NULL);
!     Assert(afterTriggers.trans_stack == NULL);
      Assert(afterTriggers.maxtransdepth == 0);
  }

*************** AfterTriggerBeginQuery(void)
*** 4362,4370 ****
  void
  AfterTriggerEndQuery(EState *estate)
  {
-     AfterTriggerEventList *events;
-     Tuplestorestate *fdw_tuplestore;
-
      /* Must be inside a query, too */
      Assert(afterTriggers.query_depth >= 0);

--- 4490,4495 ----
*************** AfterTriggerEndQuery(EState *estate)
*** 4393,4430 ****
       * will instead fire any triggers in a dedicated query level.  Foreign key
       * enforcement triggers do add to the current query level, thanks to their
       * passing fire_triggers = false to SPI_execute_snapshot().  Other
!      * C-language triggers might do likewise.  Be careful here: firing a
!      * trigger could result in query_stack being repalloc'd, so we can't save
!      * its address across afterTriggerInvokeEvents calls.
       *
       * If we find no firable events, we don't have to increment
       * firing_counter.
       */
      for (;;)
      {
!         events = &afterTriggers.query_stack[afterTriggers.query_depth];
!         if (afterTriggerMarkEvents(events, &afterTriggers.events, true))
          {
              CommandId    firing_id = afterTriggers.firing_counter++;

              /* OK to delete the immediate events after processing them */
!             if (afterTriggerInvokeEvents(events, firing_id, estate, true))
                  break;            /* all fired */
          }
          else
              break;
      }

!     /* Release query-local storage for events, including tuplestore if any */
!     fdw_tuplestore = afterTriggers.fdw_tuplestores[afterTriggers.query_depth];
!     if (fdw_tuplestore)
      {
!         tuplestore_end(fdw_tuplestore);
!         afterTriggers.fdw_tuplestores[afterTriggers.query_depth] = NULL;
      }
-     afterTriggerFreeEventList(&afterTriggers.query_stack[afterTriggers.query_depth]);

!     afterTriggers.query_depth--;
  }


--- 4518,4616 ----
       * will instead fire any triggers in a dedicated query level.  Foreign key
       * enforcement triggers do add to the current query level, thanks to their
       * passing fire_triggers = false to SPI_execute_snapshot().  Other
!      * C-language triggers might do likewise.
       *
       * If we find no firable events, we don't have to increment
       * firing_counter.
       */
      for (;;)
      {
!         AfterTriggersQueryData *qs;
!         ListCell   *lc;
!
!         /*
!          * Firing a trigger could result in query_stack being repalloc'd, so
!          * we must recalculate qs after each afterTriggerInvokeEvents call.
!          */
!         qs = &afterTriggers.query_stack[afterTriggers.query_depth];
!
!         /*
!          * Before each firing cycle, mark all existing transition tables
!          * "closed", so that their contents can't change anymore.  If someone
!          * causes more updates, they'll go into new transition tables.
!          */
!         foreach(lc, qs->tables)
!         {
!             AfterTriggersTableData *table = (AfterTriggersTableData *) lfirst(lc);
!
!             table->closed = true;
!         }
!
!         if (afterTriggerMarkEvents(&qs->events, &afterTriggers.events, true))
          {
              CommandId    firing_id = afterTriggers.firing_counter++;

              /* OK to delete the immediate events after processing them */
!             if (afterTriggerInvokeEvents(&qs->events, firing_id, estate, true))
                  break;            /* all fired */
          }
          else
              break;
      }

!     /* Release query-level-local storage, including tuplestores if any */
!     AfterTriggerFreeQuery(&afterTriggers.query_stack[afterTriggers.query_depth]);
!
!     afterTriggers.query_depth--;
! }
!
!
! /*
!  * AfterTriggerFreeQuery
!  *    Release subsidiary storage for a trigger query level.
!  *    This includes closing down tuplestores.
!  *    Note: it's important for this to be safe if interrupted by an error
!  *    and then called again for the same query level.
!  */
! static void
! AfterTriggerFreeQuery(AfterTriggersQueryData *qs)
! {
!     Tuplestorestate *ts;
!     List       *tables;
!     ListCell   *lc;
!
!     /* Drop the trigger events */
!     afterTriggerFreeEventList(&qs->events);
!
!     /* Drop FDW tuplestore if any */
!     ts = qs->fdw_tuplestore;
!     qs->fdw_tuplestore = NULL;
!     if (ts)
!         tuplestore_end(ts);
!
!     /* Release per-table subsidiary storage */
!     tables = qs->tables;
!     foreach(lc, tables)
      {
!         AfterTriggersTableData *table = (AfterTriggersTableData *) lfirst(lc);
!
!         ts = table->old_tuplestore;
!         table->old_tuplestore = NULL;
!         if (ts)
!             tuplestore_end(ts);
!         ts = table->new_tuplestore;
!         table->new_tuplestore = NULL;
!         if (ts)
!             tuplestore_end(ts);
      }

!     /*
!      * Now free the AfterTriggersTableData structs and list cells.  Reset list
!      * pointer first; if list_free_deep somehow gets an error, better to leak
!      * that storage than have an infinite loop.
!      */
!     qs->tables = NIL;
!     list_free_deep(tables);
  }


*************** AfterTriggerEndXact(bool isCommit)
*** 4521,4530 ****
       * large, we let the eventual reset of TopTransactionContext free the
       * memory instead of doing it here.
       */
!     afterTriggers.state_stack = NULL;
!     afterTriggers.events_stack = NULL;
!     afterTriggers.depth_stack = NULL;
!     afterTriggers.firing_stack = NULL;
      afterTriggers.maxtransdepth = 0;


--- 4707,4713 ----
       * large, we let the eventual reset of TopTransactionContext free the
       * memory instead of doing it here.
       */
!     afterTriggers.trans_stack = NULL;
      afterTriggers.maxtransdepth = 0;


*************** AfterTriggerEndXact(bool isCommit)
*** 4534,4540 ****
       * memory here.
       */
      afterTriggers.query_stack = NULL;
-     afterTriggers.fdw_tuplestores = NULL;
      afterTriggers.maxquerydepth = 0;
      afterTriggers.state = NULL;

--- 4717,4722 ----
*************** AfterTriggerBeginSubXact(void)
*** 4553,4600 ****
      int            my_level = GetCurrentTransactionNestLevel();

      /*
!      * Allocate more space in the stacks if needed.  (Note: because the
       * minimum nest level of a subtransaction is 2, we waste the first couple
!      * entries of each array; not worth the notational effort to avoid it.)
       */
      while (my_level >= afterTriggers.maxtransdepth)
      {
          if (afterTriggers.maxtransdepth == 0)
          {
!             MemoryContext old_cxt;
!
!             old_cxt = MemoryContextSwitchTo(TopTransactionContext);
!
! #define DEFTRIG_INITALLOC 8
!             afterTriggers.state_stack = (SetConstraintState *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(SetConstraintState));
!             afterTriggers.events_stack = (AfterTriggerEventList *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(AfterTriggerEventList));
!             afterTriggers.depth_stack = (int *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(int));
!             afterTriggers.firing_stack = (CommandId *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(CommandId));
!             afterTriggers.maxtransdepth = DEFTRIG_INITALLOC;
!
!             MemoryContextSwitchTo(old_cxt);
          }
          else
          {
!             /* repalloc will keep the stacks in the same context */
              int            new_alloc = afterTriggers.maxtransdepth * 2;

!             afterTriggers.state_stack = (SetConstraintState *)
!                 repalloc(afterTriggers.state_stack,
!                          new_alloc * sizeof(SetConstraintState));
!             afterTriggers.events_stack = (AfterTriggerEventList *)
!                 repalloc(afterTriggers.events_stack,
!                          new_alloc * sizeof(AfterTriggerEventList));
!             afterTriggers.depth_stack = (int *)
!                 repalloc(afterTriggers.depth_stack,
!                          new_alloc * sizeof(int));
!             afterTriggers.firing_stack = (CommandId *)
!                 repalloc(afterTriggers.firing_stack,
!                          new_alloc * sizeof(CommandId));
              afterTriggers.maxtransdepth = new_alloc;
          }
      }
--- 4735,4762 ----
      int            my_level = GetCurrentTransactionNestLevel();

      /*
!      * Allocate more space in the trans_stack if needed.  (Note: because the
       * minimum nest level of a subtransaction is 2, we waste the first couple
!      * entries of the array; not worth the notational effort to avoid it.)
       */
      while (my_level >= afterTriggers.maxtransdepth)
      {
          if (afterTriggers.maxtransdepth == 0)
          {
!             /* Arbitrarily initialize for max of 8 subtransaction levels */
!             afterTriggers.trans_stack = (AfterTriggersTransData *)
!                 MemoryContextAlloc(TopTransactionContext,
!                                    8 * sizeof(AfterTriggersTransData));
!             afterTriggers.maxtransdepth = 8;
          }
          else
          {
!             /* repalloc will keep the stack in the same context */
              int            new_alloc = afterTriggers.maxtransdepth * 2;

!             afterTriggers.trans_stack = (AfterTriggersTransData *)
!                 repalloc(afterTriggers.trans_stack,
!                          new_alloc * sizeof(AfterTriggersTransData));
              afterTriggers.maxtransdepth = new_alloc;
          }
      }
*************** AfterTriggerBeginSubXact(void)
*** 4604,4613 ****
       * is not saved until/unless changed.  Likewise, we don't make a
       * per-subtransaction event context until needed.
       */
!     afterTriggers.state_stack[my_level] = NULL;
!     afterTriggers.events_stack[my_level] = afterTriggers.events;
!     afterTriggers.depth_stack[my_level] = afterTriggers.query_depth;
!     afterTriggers.firing_stack[my_level] = afterTriggers.firing_counter;
  }

  /*
--- 4766,4775 ----
       * is not saved until/unless changed.  Likewise, we don't make a
       * per-subtransaction event context until needed.
       */
!     afterTriggers.trans_stack[my_level].state = NULL;
!     afterTriggers.trans_stack[my_level].events = afterTriggers.events;
!     afterTriggers.trans_stack[my_level].query_depth = afterTriggers.query_depth;
!     afterTriggers.trans_stack[my_level].firing_counter = afterTriggers.firing_counter;
  }

  /*
*************** AfterTriggerEndSubXact(bool isCommit)
*** 4631,4700 ****
      {
          Assert(my_level < afterTriggers.maxtransdepth);
          /* If we saved a prior state, we don't need it anymore */
!         state = afterTriggers.state_stack[my_level];
          if (state != NULL)
              pfree(state);
          /* this avoids double pfree if error later: */
!         afterTriggers.state_stack[my_level] = NULL;
          Assert(afterTriggers.query_depth ==
!                afterTriggers.depth_stack[my_level]);
      }
      else
      {
          /*
           * Aborting.  It is possible subxact start failed before calling
           * AfterTriggerBeginSubXact, in which case we mustn't risk touching
!          * stack levels that aren't there.
           */
          if (my_level >= afterTriggers.maxtransdepth)
              return;

          /*
!          * Release any event lists from queries being aborted, and restore
           * query_depth to its pre-subxact value.  This assumes that a
           * subtransaction will not add events to query levels started in a
           * earlier transaction state.
           */
!         while (afterTriggers.query_depth > afterTriggers.depth_stack[my_level])
          {
              if (afterTriggers.query_depth < afterTriggers.maxquerydepth)
!             {
!                 Tuplestorestate *ts;
!
!                 ts = afterTriggers.fdw_tuplestores[afterTriggers.query_depth];
!                 if (ts)
!                 {
!                     tuplestore_end(ts);
!                     afterTriggers.fdw_tuplestores[afterTriggers.query_depth] = NULL;
!                 }
!
!                 afterTriggerFreeEventList(&afterTriggers.query_stack[afterTriggers.query_depth]);
!             }
!
              afterTriggers.query_depth--;
          }
          Assert(afterTriggers.query_depth ==
!                afterTriggers.depth_stack[my_level]);

          /*
           * Restore the global deferred-event list to its former length,
           * discarding any events queued by the subxact.
           */
          afterTriggerRestoreEventList(&afterTriggers.events,
!                                      &afterTriggers.events_stack[my_level]);

          /*
           * Restore the trigger state.  If the saved state is NULL, then this
           * subxact didn't save it, so it doesn't need restoring.
           */
!         state = afterTriggers.state_stack[my_level];
          if (state != NULL)
          {
              pfree(afterTriggers.state);
              afterTriggers.state = state;
          }
          /* this avoids double pfree if error later: */
!         afterTriggers.state_stack[my_level] = NULL;

          /*
           * Scan for any remaining deferred events that were marked DONE or IN
--- 4793,4850 ----
      {
          Assert(my_level < afterTriggers.maxtransdepth);
          /* If we saved a prior state, we don't need it anymore */
!         state = afterTriggers.trans_stack[my_level].state;
          if (state != NULL)
              pfree(state);
          /* this avoids double pfree if error later: */
!         afterTriggers.trans_stack[my_level].state = NULL;
          Assert(afterTriggers.query_depth ==
!                afterTriggers.trans_stack[my_level].query_depth);
      }
      else
      {
          /*
           * Aborting.  It is possible subxact start failed before calling
           * AfterTriggerBeginSubXact, in which case we mustn't risk touching
!          * trans_stack levels that aren't there.
           */
          if (my_level >= afterTriggers.maxtransdepth)
              return;

          /*
!          * Release query-level storage for queries being aborted, and restore
           * query_depth to its pre-subxact value.  This assumes that a
           * subtransaction will not add events to query levels started in a
           * earlier transaction state.
           */
!         while (afterTriggers.query_depth > afterTriggers.trans_stack[my_level].query_depth)
          {
              if (afterTriggers.query_depth < afterTriggers.maxquerydepth)
!                 AfterTriggerFreeQuery(&afterTriggers.query_stack[afterTriggers.query_depth]);
              afterTriggers.query_depth--;
          }
          Assert(afterTriggers.query_depth ==
!                afterTriggers.trans_stack[my_level].query_depth);

          /*
           * Restore the global deferred-event list to its former length,
           * discarding any events queued by the subxact.
           */
          afterTriggerRestoreEventList(&afterTriggers.events,
!                                      &afterTriggers.trans_stack[my_level].events);

          /*
           * Restore the trigger state.  If the saved state is NULL, then this
           * subxact didn't save it, so it doesn't need restoring.
           */
!         state = afterTriggers.trans_stack[my_level].state;
          if (state != NULL)
          {
              pfree(afterTriggers.state);
              afterTriggers.state = state;
          }
          /* this avoids double pfree if error later: */
!         afterTriggers.trans_stack[my_level].state = NULL;

          /*
           * Scan for any remaining deferred events that were marked DONE or IN
*************** AfterTriggerEndSubXact(bool isCommit)
*** 4704,4710 ****
           * (This essentially assumes that the current subxact includes all
           * subxacts started after it.)
           */
!         subxact_firing_id = afterTriggers.firing_stack[my_level];
          for_each_event_chunk(event, chunk, afterTriggers.events)
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);
--- 4854,4860 ----
           * (This essentially assumes that the current subxact includes all
           * subxacts started after it.)
           */
!         subxact_firing_id = afterTriggers.trans_stack[my_level].firing_counter;
          for_each_event_chunk(event, chunk, afterTriggers.events)
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);
*************** AfterTriggerEnlargeQueryState(void)
*** 4740,4751 ****
      {
          int            new_alloc = Max(afterTriggers.query_depth + 1, 8);

!         afterTriggers.query_stack = (AfterTriggerEventList *)
              MemoryContextAlloc(TopTransactionContext,
!                                new_alloc * sizeof(AfterTriggerEventList));
!         afterTriggers.fdw_tuplestores = (Tuplestorestate **)
!             MemoryContextAllocZero(TopTransactionContext,
!                                    new_alloc * sizeof(Tuplestorestate *));
          afterTriggers.maxquerydepth = new_alloc;
      }
      else
--- 4890,4898 ----
      {
          int            new_alloc = Max(afterTriggers.query_depth + 1, 8);

!         afterTriggers.query_stack = (AfterTriggersQueryData *)
              MemoryContextAlloc(TopTransactionContext,
!                                new_alloc * sizeof(AfterTriggersQueryData));
          afterTriggers.maxquerydepth = new_alloc;
      }
      else
*************** AfterTriggerEnlargeQueryState(void)
*** 4755,4781 ****
          int            new_alloc = Max(afterTriggers.query_depth + 1,
                                      old_alloc * 2);

!         afterTriggers.query_stack = (AfterTriggerEventList *)
              repalloc(afterTriggers.query_stack,
!                      new_alloc * sizeof(AfterTriggerEventList));
!         afterTriggers.fdw_tuplestores = (Tuplestorestate **)
!             repalloc(afterTriggers.fdw_tuplestores,
!                      new_alloc * sizeof(Tuplestorestate *));
!         /* Clear newly-allocated slots for subsequent lazy initialization. */
!         memset(afterTriggers.fdw_tuplestores + old_alloc,
!                0, (new_alloc - old_alloc) * sizeof(Tuplestorestate *));
          afterTriggers.maxquerydepth = new_alloc;
      }

!     /* Initialize new query lists to empty */
      while (init_depth < afterTriggers.maxquerydepth)
      {
!         AfterTriggerEventList *events;

!         events = &afterTriggers.query_stack[init_depth];
!         events->head = NULL;
!         events->tail = NULL;
!         events->tailfree = NULL;

          ++init_depth;
      }
--- 4902,4923 ----
          int            new_alloc = Max(afterTriggers.query_depth + 1,
                                      old_alloc * 2);

!         afterTriggers.query_stack = (AfterTriggersQueryData *)
              repalloc(afterTriggers.query_stack,
!                      new_alloc * sizeof(AfterTriggersQueryData));
          afterTriggers.maxquerydepth = new_alloc;
      }

!     /* Initialize new array entries to empty */
      while (init_depth < afterTriggers.maxquerydepth)
      {
!         AfterTriggersQueryData *qs = &afterTriggers.query_stack[init_depth];

!         qs->events.head = NULL;
!         qs->events.tail = NULL;
!         qs->events.tailfree = NULL;
!         qs->fdw_tuplestore = NULL;
!         qs->tables = NIL;

          ++init_depth;
      }
*************** AfterTriggerSetState(ConstraintsSetStmt
*** 4873,4881 ****
       * save it so it can be restored if the subtransaction aborts.
       */
      if (my_level > 1 &&
!         afterTriggers.state_stack[my_level] == NULL)
      {
!         afterTriggers.state_stack[my_level] =
              SetConstraintStateCopy(afterTriggers.state);
      }

--- 5015,5023 ----
       * save it so it can be restored if the subtransaction aborts.
       */
      if (my_level > 1 &&
!         afterTriggers.trans_stack[my_level].state == NULL)
      {
!         afterTriggers.trans_stack[my_level].state =
              SetConstraintStateCopy(afterTriggers.state);
      }

*************** AfterTriggerPendingOnRel(Oid relid)
*** 5184,5190 ****
       */
      for (depth = 0; depth <= afterTriggers.query_depth && depth < afterTriggers.maxquerydepth; depth++)
      {
!         for_each_event_chunk(event, chunk, afterTriggers.query_stack[depth])
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);

--- 5326,5332 ----
       */
      for (depth = 0; depth <= afterTriggers.query_depth && depth < afterTriggers.maxquerydepth; depth++)
      {
!         for_each_event_chunk(event, chunk, afterTriggers.query_stack[depth].events)
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);

*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5229,5235 ****
      TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
      AfterTriggerEventData new_event;
      AfterTriggerSharedData new_shared;
!     char        relkind = relinfo->ri_RelationDesc->rd_rel->relkind;
      int            tgtype_event;
      int            tgtype_level;
      int            i;
--- 5371,5378 ----
      TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
      AfterTriggerEventData new_event;
      AfterTriggerSharedData new_shared;
!     char        relkind = rel->rd_rel->relkind;
!     AfterTriggersTableData *table;
      int            tgtype_event;
      int            tgtype_level;
      int            i;
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5284,5293 ****
              Tuplestorestate *new_tuplestore;

              Assert(newtup != NULL);
!             if (event == TRIGGER_EVENT_INSERT)
!                 new_tuplestore = transition_capture->tcs_insert_tuplestore;
!             else
!                 new_tuplestore = transition_capture->tcs_update_tuplestore;

              if (original_insert_tuple != NULL)
                  tuplestore_puttuple(new_tuplestore, original_insert_tuple);
--- 5427,5433 ----
              Tuplestorestate *new_tuplestore;

              Assert(newtup != NULL);
!             new_tuplestore = transition_capture->tcs_new_tuplestore;

              if (original_insert_tuple != NULL)
                  tuplestore_puttuple(new_tuplestore, original_insert_tuple);
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5316,5321 ****
--- 5456,5464 ----
       * The event code will be used both as a bitmask and an array offset, so
       * validation is important to make sure we don't walk off the edge of our
       * arrays.
+      *
+      * Also, if we're considering statement-level triggers, check whether we
+      * already queued them for this event set, and return if so.
       */
      switch (event)
      {
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5334,5339 ****
--- 5477,5487 ----
                  Assert(newtup == NULL);
                  ItemPointerSetInvalid(&(new_event.ate_ctid1));
                  ItemPointerSetInvalid(&(new_event.ate_ctid2));
+                 table = GetAfterTriggersTableData(RelationGetRelid(rel),
+                                                   CMD_INSERT);
+                 if (table->stmt_trig_done)
+                     return;
+                 table->stmt_trig_done = true;
              }
              break;
          case TRIGGER_EVENT_DELETE:
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5351,5356 ****
--- 5499,5509 ----
                  Assert(newtup == NULL);
                  ItemPointerSetInvalid(&(new_event.ate_ctid1));
                  ItemPointerSetInvalid(&(new_event.ate_ctid2));
+                 table = GetAfterTriggersTableData(RelationGetRelid(rel),
+                                                   CMD_DELETE);
+                 if (table->stmt_trig_done)
+                     return;
+                 table->stmt_trig_done = true;
              }
              break;
          case TRIGGER_EVENT_UPDATE:
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5368,5373 ****
--- 5521,5531 ----
                  Assert(newtup == NULL);
                  ItemPointerSetInvalid(&(new_event.ate_ctid1));
                  ItemPointerSetInvalid(&(new_event.ate_ctid2));
+                 table = GetAfterTriggersTableData(RelationGetRelid(rel),
+                                                   CMD_UPDATE);
+                 if (table->stmt_trig_done)
+                     return;
+                 table->stmt_trig_done = true;
              }
              break;
          case TRIGGER_EVENT_TRUNCATE:
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5407,5415 ****
          {
              if (fdw_tuplestore == NULL)
              {
!                 fdw_tuplestore =
!                     GetTriggerTransitionTuplestore
!                     (afterTriggers.fdw_tuplestores);
                  new_event.ate_flags = AFTER_TRIGGER_FDW_FETCH;
              }
              else
--- 5565,5571 ----
          {
              if (fdw_tuplestore == NULL)
              {
!                 fdw_tuplestore = GetCurrentFDWTuplestore();
                  new_event.ate_flags = AFTER_TRIGGER_FDW_FETCH;
              }
              else
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5474,5484 ****
          new_shared.ats_tgoid = trigger->tgoid;
          new_shared.ats_relid = RelationGetRelid(rel);
          new_shared.ats_firing_id = 0;
!         /* deferrable triggers cannot access transition data */
!         new_shared.ats_transition_capture =
!             trigger->tgdeferrable ? NULL : transition_capture;

!         afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth],
                               &new_event, &new_shared);
      }

--- 5630,5648 ----
          new_shared.ats_tgoid = trigger->tgoid;
          new_shared.ats_relid = RelationGetRelid(rel);
          new_shared.ats_firing_id = 0;
!         /* deferrable triggers cannot access transition tables */
!         if (trigger->tgdeferrable || transition_capture == NULL)
!         {
!             new_shared.ats_old_tuplestore = NULL;
!             new_shared.ats_new_tuplestore = NULL;
!         }
!         else
!         {
!             new_shared.ats_old_tuplestore = transition_capture->tcs_old_tuplestore;
!             new_shared.ats_new_tuplestore = transition_capture->tcs_new_tuplestore;
!         }

!         afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
                               &new_event, &new_shared);
      }

diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..b3e74aa 100644
*** a/src/backend/executor/README
--- b/src/backend/executor/README
*************** This is a sketch of control flow for ful
*** 241,251 ****
          CreateExecutorState
              creates per-query context
          switch to per-query context to run ExecInitNode
          ExecInitNode --- recursively scans plan tree
              CreateExprContext
                  creates per-tuple context
              ExecInitExpr
-         AfterTriggerBeginQuery

      ExecutorRun
          ExecProcNode --- recursively called in per-query context
--- 241,251 ----
          CreateExecutorState
              creates per-query context
          switch to per-query context to run ExecInitNode
+         AfterTriggerBeginQuery
          ExecInitNode --- recursively scans plan tree
              CreateExprContext
                  creates per-tuple context
              ExecInitExpr

      ExecutorRun
          ExecProcNode --- recursively called in per-query context
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d4..6255cc6 100644
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
*************** standard_ExecutorStart(QueryDesc *queryD
*** 252,268 ****
      estate->es_instrument = queryDesc->instrument_options;

      /*
-      * Initialize the plan state tree
-      */
-     InitPlan(queryDesc, eflags);
-
-     /*
       * Set up an AFTER-trigger statement context, unless told not to, or
       * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
       */
      if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
          AfterTriggerBeginQuery();

      MemoryContextSwitchTo(oldcontext);
  }

--- 252,268 ----
      estate->es_instrument = queryDesc->instrument_options;

      /*
       * Set up an AFTER-trigger statement context, unless told not to, or
       * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
       */
      if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
          AfterTriggerBeginQuery();

+     /*
+      * Initialize the plan state tree
+      */
+     InitPlan(queryDesc, eflags);
+
      MemoryContextSwitchTo(oldcontext);
  }

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 49586a3..845c409 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ExecInsert(ModifyTableState *mtstate,
*** 343,348 ****
--- 343,351 ----
                  mtstate->mt_transition_capture->tcs_map = NULL;
              }
          }
+         if (mtstate->mt_oc_transition_capture != NULL)
+             mtstate->mt_oc_transition_capture->tcs_map =
+                 mtstate->mt_transition_tupconv_maps[leaf_part_index];

          /*
           * We might need to convert from the parent rowtype to the partition
*************** lreplace:;
*** 1158,1163 ****
--- 1161,1168 ----
      /* AFTER ROW UPDATE Triggers */
      ExecARUpdateTriggers(estate, resultRelInfo, tupleid, oldtuple, tuple,
                           recheckIndexes,
+                          mtstate->operation == CMD_INSERT ?
+                          mtstate->mt_oc_transition_capture :
                           mtstate->mt_transition_capture);

      list_free(recheckIndexes);
*************** fireASTriggers(ModifyTableState *node)
*** 1444,1450 ****
              if (node->mt_onconflict == ONCONFLICT_UPDATE)
                  ExecASUpdateTriggers(node->ps.state,
                                       resultRelInfo,
!                                      node->mt_transition_capture);
              ExecASInsertTriggers(node->ps.state, resultRelInfo,
                                   node->mt_transition_capture);
              break;
--- 1449,1455 ----
              if (node->mt_onconflict == ONCONFLICT_UPDATE)
                  ExecASUpdateTriggers(node->ps.state,
                                       resultRelInfo,
!                                      node->mt_oc_transition_capture);
              ExecASInsertTriggers(node->ps.state, resultRelInfo,
                                   node->mt_transition_capture);
              break;
*************** ExecSetupTransitionCaptureState(ModifyTa
*** 1474,1487 ****

      /* Check for transition tables on the directly targeted relation. */
      mtstate->mt_transition_capture =
!         MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);

      /*
       * If we found that we need to collect transition tuples then we may also
       * need tuple conversion maps for any children that have TupleDescs that
!      * aren't compatible with the tuplestores.
       */
!     if (mtstate->mt_transition_capture != NULL)
      {
          ResultRelInfo *resultRelInfos;
          int            numResultRelInfos;
--- 1479,1502 ----

      /* Check for transition tables on the directly targeted relation. */
      mtstate->mt_transition_capture =
!         MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc,
!                                    RelationGetRelid(targetRelInfo->ri_RelationDesc),
!                                    mtstate->operation);
!     if (mtstate->operation == CMD_INSERT &&
!         mtstate->mt_onconflict == ONCONFLICT_UPDATE)
!         mtstate->mt_oc_transition_capture =
!             MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc,
!                                        RelationGetRelid(targetRelInfo->ri_RelationDesc),
!                                        CMD_UPDATE);

      /*
       * If we found that we need to collect transition tuples then we may also
       * need tuple conversion maps for any children that have TupleDescs that
!      * aren't compatible with the tuplestores.  (We can share these maps
!      * between the regular and ON CONFLICT cases.)
       */
!     if (mtstate->mt_transition_capture != NULL ||
!         mtstate->mt_oc_transition_capture != NULL)
      {
          ResultRelInfo *resultRelInfos;
          int            numResultRelInfos;
*************** ExecSetupTransitionCaptureState(ModifyTa
*** 1522,1531 ****
          /*
           * Install the conversion map for the first plan for UPDATE and DELETE
           * operations.  It will be advanced each time we switch to the next
!          * plan.  (INSERT operations set it every time.)
           */
!         mtstate->mt_transition_capture->tcs_map =
!             mtstate->mt_transition_tupconv_maps[0];
      }
  }

--- 1537,1548 ----
          /*
           * Install the conversion map for the first plan for UPDATE and DELETE
           * operations.  It will be advanced each time we switch to the next
!          * plan.  (INSERT operations set it every time, so we need not update
!          * mtstate->mt_oc_transition_capture here.)
           */
!         if (mtstate->mt_transition_capture)
!             mtstate->mt_transition_capture->tcs_map =
!                 mtstate->mt_transition_tupconv_maps[0];
      }
  }

*************** ExecModifyTable(PlanState *pstate)
*** 1629,1641 ****
                  estate->es_result_relation_info = resultRelInfo;
                  EvalPlanQualSetPlan(&node->mt_epqstate, subplanstate->plan,
                                      node->mt_arowmarks[node->mt_whichplan]);
                  if (node->mt_transition_capture != NULL)
                  {
-                     /* Prepare to convert transition tuples from this child. */
                      Assert(node->mt_transition_tupconv_maps != NULL);
                      node->mt_transition_capture->tcs_map =
                          node->mt_transition_tupconv_maps[node->mt_whichplan];
                  }
                  continue;
              }
              else
--- 1646,1664 ----
                  estate->es_result_relation_info = resultRelInfo;
                  EvalPlanQualSetPlan(&node->mt_epqstate, subplanstate->plan,
                                      node->mt_arowmarks[node->mt_whichplan]);
+                 /* Prepare to convert transition tuples from this child. */
                  if (node->mt_transition_capture != NULL)
                  {
                      Assert(node->mt_transition_tupconv_maps != NULL);
                      node->mt_transition_capture->tcs_map =
                          node->mt_transition_tupconv_maps[node->mt_whichplan];
                  }
+                 if (node->mt_oc_transition_capture != NULL)
+                 {
+                     Assert(node->mt_transition_tupconv_maps != NULL);
+                     node->mt_oc_transition_capture->tcs_map =
+                         node->mt_transition_tupconv_maps[node->mt_whichplan];
+                 }
                  continue;
              }
              else
*************** ExecInitModifyTable(ModifyTable *node, E
*** 1934,1941 ****
          mtstate->mt_partition_tuple_slot = partition_tuple_slot;
      }

!     /* Build state for collecting transition tuples */
!     ExecSetupTransitionCaptureState(mtstate, estate);

      /*
       * Initialize any WITH CHECK OPTION constraints if needed.
--- 1957,1968 ----
          mtstate->mt_partition_tuple_slot = partition_tuple_slot;
      }

!     /*
!      * Build state for collecting transition tuples.  This requires having a
!      * valid trigger query context, so skip it in explain-only mode.
!      */
!     if (!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
!         ExecSetupTransitionCaptureState(mtstate, estate);

      /*
       * Initialize any WITH CHECK OPTION constraints if needed.
*************** ExecEndModifyTable(ModifyTableState *nod
*** 2319,2334 ****
      int            i;

      /*
-      * Free transition tables, unless this query is being run in
-      * EXEC_FLAG_SKIP_TRIGGERS mode, which means that it may have queued AFTER
-      * triggers that won't be run till later.  In that case we'll just leak
-      * the transition tables till end of (sub)transaction.
-      */
-     if (node->mt_transition_capture != NULL &&
-         !(node->ps.state->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
-         DestroyTransitionCaptureState(node->mt_transition_capture);
-
-     /*
       * Allow any FDWs to shut down
       */
      for (i = 0; i < node->mt_nplans; i++)
--- 2346,2351 ----
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index aeb363f..c6a5684 100644
*** a/src/include/commands/trigger.h
--- b/src/include/commands/trigger.h
*************** typedef struct TriggerData
*** 43,49 ****

  /*
   * The state for capturing old and new tuples into transition tables for a
!  * single ModifyTable node.
   */
  typedef struct TransitionCaptureState
  {
--- 43,54 ----

  /*
   * The state for capturing old and new tuples into transition tables for a
!  * single ModifyTable node (or other operation source, e.g. copy.c).
!  *
!  * This is per-caller to avoid conflicts in setting tcs_map or
!  * tcs_original_insert_tuple.  Note, however, that the pointed-to
!  * tuplestores may be shared across multiple callers; they are managed
!  * by trigger.c.
   */
  typedef struct TransitionCaptureState
  {
*************** typedef struct TransitionCaptureState
*** 60,66 ****
       * For UPDATE and DELETE, AfterTriggerSaveEvent may need to convert the
       * new and old tuples from a child table's format to the format of the
       * relation named in a query so that it is compatible with the transition
!      * tuplestores.
       */
      TupleConversionMap *tcs_map;

--- 65,71 ----
       * For UPDATE and DELETE, AfterTriggerSaveEvent may need to convert the
       * new and old tuples from a child table's format to the format of the
       * relation named in a query so that it is compatible with the transition
!      * tuplestores.  The caller must store the conversion map here if so.
       */
      TupleConversionMap *tcs_map;

*************** typedef struct TransitionCaptureState
*** 74,90 ****
      HeapTuple    tcs_original_insert_tuple;

      /*
!      * The tuplestores backing the transition tables.  We use separate
!      * tuplestores for INSERT and UPDATE, because INSERT ... ON CONFLICT ...
!      * DO UPDATE causes INSERT and UPDATE triggers to fire and needs a way to
!      * keep track of the new tuple images resulting from the two cases
!      * separately.  We only need a single old image tuplestore, because there
!      * is no statement that can both update and delete at the same time.
       */
!     Tuplestorestate *tcs_old_tuplestore;    /* for DELETE and UPDATE old
!                                              * images */
!     Tuplestorestate *tcs_insert_tuplestore; /* for INSERT new images */
!     Tuplestorestate *tcs_update_tuplestore; /* for UPDATE new images */
  } TransitionCaptureState;

  /*
--- 79,91 ----
      HeapTuple    tcs_original_insert_tuple;

      /*
!      * The tuplestore(s) into which to insert tuples.  Either may be NULL if
!      * not needed for the operation type.
       */
!     Tuplestorestate *tcs_old_tuplestore;    /* for DELETE and UPDATE-old
!                                              * tuples */
!     Tuplestorestate *tcs_new_tuplestore;    /* for INSERT and UPDATE-new
!                                              * tuples */
  } TransitionCaptureState;

  /*
*************** extern void RelationBuildTriggers(Relati
*** 174,181 ****
  extern TriggerDesc *CopyTriggerDesc(TriggerDesc *trigdesc);

  extern const char *FindTriggerIncompatibleWithInheritance(TriggerDesc *trigdesc);
! extern TransitionCaptureState *MakeTransitionCaptureState(TriggerDesc *trigdesc);
! extern void DestroyTransitionCaptureState(TransitionCaptureState *tcs);

  extern void FreeTriggerDesc(TriggerDesc *trigdesc);

--- 175,183 ----
  extern TriggerDesc *CopyTriggerDesc(TriggerDesc *trigdesc);

  extern const char *FindTriggerIncompatibleWithInheritance(TriggerDesc *trigdesc);
!
! extern TransitionCaptureState *MakeTransitionCaptureState(TriggerDesc *trigdesc,
!                            Oid relid, CmdType cmdType);

  extern void FreeTriggerDesc(TriggerDesc *trigdesc);

diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..c6d3021 100644
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct ModifyTableState
*** 983,989 ****
      /* Per partition tuple conversion map */
      TupleTableSlot *mt_partition_tuple_slot;
      struct TransitionCaptureState *mt_transition_capture;
!     /* controls transition table population */
      TupleConversionMap **mt_transition_tupconv_maps;
      /* Per plan/partition tuple conversion */
  } ModifyTableState;
--- 983,991 ----
      /* Per partition tuple conversion map */
      TupleTableSlot *mt_partition_tuple_slot;
      struct TransitionCaptureState *mt_transition_capture;
!     /* controls transition table population for specified operation */
!     struct TransitionCaptureState *mt_oc_transition_capture;
!     /* controls transition table population for INSERT...ON CONFLICT UPDATE */
      TupleConversionMap **mt_transition_tupconv_maps;
      /* Per plan/partition tuple conversion */
  } ModifyTableState;
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 620fac1..0987be5 100644
*** a/src/test/regress/expected/triggers.out
--- b/src/test/regress/expected/triggers.out
*************** with wcte as (insert into table1 values
*** 2217,2222 ****
--- 2217,2239 ----
    insert into table2 values ('hello world');
  NOTICE:  trigger = table2_trig, new table = ("hello world")
  NOTICE:  trigger = table1_trig, new table = (42)
+ with wcte as (insert into table1 values (43))
+   insert into table1 values (44);
+ NOTICE:  trigger = table1_trig, new table = (43), (44)
+ select * from table1;
+  a
+ ----
+  42
+  44
+  43
+ (3 rows)
+
+ select * from table2;
+       a
+ -------------
+  hello world
+ (1 row)
+
  drop table table1;
  drop table table2;
  --
*************** create trigger my_table_multievent_trig
*** 2256,2261 ****
--- 2273,2286 ----
    after insert or update on my_table referencing new table as new_table
    for each statement execute procedure dump_insert();
  ERROR:  transition tables cannot be specified for triggers with more than one event
+ --
+ -- Verify that you can't create a trigger with transition tables with
+ -- a column list.
+ --
+ create trigger my_table_col_update_trig
+   after update of b on my_table referencing new table as new_table
+   for each statement execute procedure dump_insert();
+ ERROR:  transition tables cannot be specified for triggers with column lists
  drop table my_table;
  --
  -- Test firing of triggers with transition tables by foreign key cascades
*************** select * from trig_table;
*** 2299,2306 ****
  (6 rows)

  delete from refd_table where length(b) = 3;
! NOTICE:  trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b")
! NOTICE:  trigger = trig_table_delete_trig, old table = (11,"one a"), (11,"one b")
  select * from trig_table;
   a |    b
  ---+---------
--- 2324,2330 ----
  (6 rows)

  delete from refd_table where length(b) = 3;
! NOTICE:  trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b"), (11,"one a"), (11,"one b")
  select * from trig_table;
   a |    b
  ---+---------
diff --git a/src/test/regress/sql/triggers.sql b/src/test/regress/sql/triggers.sql
index c6deb56..10eee76 100644
*** a/src/test/regress/sql/triggers.sql
--- b/src/test/regress/sql/triggers.sql
*************** create trigger table2_trig
*** 1729,1734 ****
--- 1729,1740 ----
  with wcte as (insert into table1 values (42))
    insert into table2 values ('hello world');

+ with wcte as (insert into table1 values (43))
+   insert into table1 values (44);
+
+ select * from table1;
+ select * from table2;
+
  drop table table1;
  drop table table2;

*************** create trigger my_table_multievent_trig
*** 1769,1774 ****
--- 1775,1789 ----
    after insert or update on my_table referencing new table as new_table
    for each statement execute procedure dump_insert();

+ --
+ -- Verify that you can't create a trigger with transition tables with
+ -- a column list.
+ --
+
+ create trigger my_table_col_update_trig
+   after update of b on my_table referencing new table as new_table
+   for each statement execute procedure dump_insert();
+
  drop table my_table;

  --

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

15 September 2017, 03:12:35

On Fri, Sep 15, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Attached is a draft patch for this.

Some initial feedback:

Compiles cleanly and make check-world passes here.

with wcte as (insert into table1 values (42))  insert into table2 values ('hello world');NOTICE:  trigger =
table2_trig,new table = ("hello world")NOTICE:  trigger = table1_trig, new table = (42)
 
+with wcte as (insert into table1 values (43))
+  insert into table1 values (44);
+NOTICE:  trigger = table1_trig, new table = (43), (44)

The effects of multiple ModifyTable nodes on the same table are
merged.  Good.  I doubt anyone will notice but it might warrant a note
somewhere that there is a user-visible change here: previously if you
did this (without TTs) your trigger would have fired twice.

+create trigger my_table_col_update_trig
+  after update of b on my_table referencing new table as new_table
+  for each statement execute procedure dump_insert();
+ERROR:  transition tables cannot be specified for triggers with column lists

Potential SQL standard non-compliance avoided.  Good.
delete from refd_table where length(b) = 3;
-NOTICE:  trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b")
-NOTICE:  trigger = trig_table_delete_trig, old table = (11,"one a"),
(11,"one b")
+NOTICE:  trigger = trig_table_delete_trig, old table = (2,"two a"),
(2,"two b"), (11,"one a"), (11,"one b")

The effects of fk cascade machinery are merged as discussed.  Good, I
think, but I'm a bit confused about how this works when the cascading
operation also fires triggers.

All the other tests show no change in behaviour.  Good.

What is going on here?

=== setup ===
create or replace function dump_delete() returns trigger language plpgsql as
$$ begin   raise notice 'trigger = %, old table = %, depth = %',                TG_NAME,                (select
string_agg(old_table::text,', ' order by a)
 
from old_table),                pg_trigger_depth();   return null; end;
$$;
create table foo (a int primary key, b int references foo(a) on delete cascade);
create trigger foo_s_trig after delete on foo referencing old table as old_table for each statement execute procedure
dump_delete();
create trigger foo_r_trig after delete on foo referencing old table as old_table for each row execute procedure
dump_delete();
insert into foo values (1, null), (2, 1);
===8<===

postgres=# delete from foo where a = 1;
NOTICE:  trigger = foo_r_trig, old table = (1,), depth = 1
NOTICE:  trigger = foo_s_trig, old table = (1,), depth = 1
NOTICE:  trigger = foo_r_trig, old table = (2,1), depth = 1
NOTICE:  trigger = foo_s_trig, old table = (2,1), depth = 1
NOTICE:  trigger = foo_s_trig, old table = <NULL>, depth = 1
DELETE 1

> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> 1.  Merging the transition tables when there are multiple wCTEs
>> referencing the same table.  Here's one idea:  Rename
>> MakeTransitionCaptureState() to GetTransitionCaptureState() and use a
>> hash table keyed by table OID in
>> afterTriggers.transition_capture_states[afterTriggers.query_depth] to
>> find the TCS for the given TriggerDesc or create it if not found, so
>> that all wCTEs find the same TransitionCaptureState object.  The all
>> existing callers continue to do what they're doing now, but they'll be
>> sharing TCSs appropriately with other things in the plan.  Note that
>> TransitionCaptureState already holds tuplestores for each operation
>> (INSERT, UPDATE, DELETE) so the OID of the table alone is a suitable
>> key for the hash table (assuming we are ignoring the column-list part
>> of the spec as you suggested).
>
> It seems unsafe to merge the TCS objects themselves, because the callers
> assume that they can munge the tcs_map and tcs_original_insert_tuple
> fields freely without regard for any other callers.  So as I have it,
> we still have a TCS for each caller, but the TCSes point at tuplestores
> that can be shared across multiple callers for the same event type.
> The tuplestores themselves are managed by the AfterTrigger data
> structures.  Also, because the TCS structs are just throwaway per-caller
> data, it's uncool to reference them in the trigger event lists.
> So I replaced ats_transition_capture with two pointers to the actual
> tuplestores.

Ok, that works and yeah it may be conceptually better.  Also, maybe
the tcs_original_insert_tuple as member of TCS is a bit clunky and
could be reconsidered later: I was trying to avoid widening a bunch of
function calls, but that might have been optimising for the wrong
thing...

> That bloats AfterTriggerSharedData a bit but I think it's
> okay; we don't expect a lot of those structs in a normal query.

Yeah, it was bloat avoidance that led me to find a way to use a single
pointer.  But at least it's in the shared part, so I don't think it
matters too much.

> I chose to make the persistent state (AfterTriggersTableData) independent
> for each operation type.  We could have done that differently perhaps, but
> it seemed more complicated and less likely to match the spec's semantics.

OK.  Your GetAfterTriggersTableData(Oid relid, CmdType cmdType) is
mirroring the spec's way of describing what happens (without the
column list).

+       foreach(lc, qs->tables)
+       {
+               table = (AfterTriggersTableData *) lfirst(lc);
+               if (table->relid == relid && table->cmdType == cmdType &&
+                       !table->closed)
+                       return table;
+       }

Yeah, my suggestion of a hash table was overkill.  (Maybe in future if
we change our rules around inheritance this could finish up being
searched for a lot of child tables; we can cross that bridge when we
come to it.)

I'm a little confused about the "closed" flag.  This seems to imply
that it's possible for there to be more than one separate tuplestore
for a given (table, operation) in a given trigger execution context.
Why is that OK?

> The INSERT ON CONFLICT UPDATE mess is handled by creating two separate
> TCSes with two different underlying AfterTriggersTableData structs.
> The insertion tuplestore sees only the inserted tuples, the update
> tuplestores see only the updated-pre-existing tuples.  That adds a little
> code to nodeModifyTable but it seems conceptually much cleaner.

OK.  The resulting behaviour is unchanged.

>> 3.  Merging the invocation after statement firing so that if you
>> updated the same table directly and also via a wCTE and also
>> indirectly via fk ON DELETE/UPDATE trigger then you still only get one
>> invocation of the after statement trigger.  Not sure exactly how...
>
> What I did here was to use the AfterTriggersTableData structs to hold
> a flag saying we'd already queued statement triggers for this rel and
> cmdType.  There's probably more than one way to do that, but this seemed
> convenient.

Seems good.  That implements the following (from whatever random draft
spec I have): "A statement-level trigger that is considered as
executed for a state change SC (in a given trigger execution context)
is not subsequently executed for SC."

> One thing I don't like too much about that is that it means there are
> cases where the single statement trigger firing would occur before some
> AFTER ROW trigger firings.  Not sure if we promise anything about the
> ordering in the docs.  It looks quite expensive/complicated to try to
> make it always happen afterwards, though, and it might well be totally
> impossible if triggers cause more table updates to occur.

I suppose that it might be possible for AfterTriggersTableData to
record the location of a previously queued event so that you can later
disable it and queue a replacement, with the effect of suppressing
earlier firings rather than later ones.  That might make sense if you
think that after statement triggers should fire after all row
triggers.  I can't figure out from the spec whether that's expected
and I'm not sure if it's useful.

> Because MakeTransitionCaptureState now depends on the trigger query
> level being active, I had to relocate the AfterTriggerBeginQuery calls
> to occur before it.

Right.

> In passing, I refactored the AfterTriggers data structures a bit so
> that we don't need to do as many palloc calls to manage them.  Instead
> of several independent arrays there's now one array of structs.

Good improvement.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

15 September 2017, 04:10:17

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Fri, Sep 15, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Attached is a draft patch for this.

> Some initial feedback:

> The effects of multiple ModifyTable nodes on the same table are
> merged.  Good.  I doubt anyone will notice but it might warrant a note
> somewhere that there is a user-visible change here: previously if you
> did this (without TTs) your trigger would have fired twice.

Check.  I've not yet looked at the documentation angle here.

> What is going on here?

Hmm ... that looks odd, but I'm too tired to debug it right now.

> Ok, that works and yeah it may be conceptually better.  Also, maybe
> the tcs_original_insert_tuple as member of TCS is a bit clunky and
> could be reconsidered later: I was trying to avoid widening a bunch of
> function calls, but that might have been optimising for the wrong
> thing...

Yeah, both tcs_map and tcs_original_insert_tuple feel like they ought
to be function parameters not persistent state.  But we can leave that
sort of refactoring for later.

> Yeah, my suggestion of a hash table was overkill.  (Maybe in future if
> we change our rules around inheritance this could finish up being
> searched for a lot of child tables; we can cross that bridge when we
> come to it.)

Exactly; I doubt it matters now, and if it ever does we can improve
the lookup infrastructure later.

> I'm a little confused about the "closed" flag.  This seems to imply
> that it's possible for there to be more than one separate tuplestore
> for a given (table, operation) in a given trigger execution context.
> Why is that OK?

The thought I had was that once we've fired any triggers that look at
a transition table, we don't want that table to change anymore; it'd
be bad if different triggers on the same event saw different tables.
So the "closed" flag is intended to shut off additional tuple additions.
If any happen (which I'm not sure if it's possible without a rogue
C-coded trigger), we'll establish a new set of transition tables for
them, and if relevant, fire AFTER STATEMENT triggers again to process
those tables.  So this sort of breaks the "one transition table and
one AFTER STATEMENT trigger firing per statement" concept, but the
alternative seems worse.  I hope that the case isn't reachable without
weird trigger behavior, though.

> I suppose that it might be possible for AfterTriggersTableData to
> record the location of a previously queued event so that you can later
> disable it and queue a replacement, with the effect of suppressing
> earlier firings rather than later ones.  That might make sense if you
> think that after statement triggers should fire after all row
> triggers.  I can't figure out from the spec whether that's expected
> and I'm not sure if it's useful.

Hm.  There could be more than one such event, but maybe we can make
something of that idea.  We could expect that the events in question
are consecutive in the event list, I think ...
        regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

15 September 2017, 18:28:34

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> What is going on here?
> ...
> insert into foo values (1, null), (2, 1);
>
> postgres=# delete from foo where a = 1;
> NOTICE:  trigger = foo_r_trig, old table = (1,), depth = 1
> NOTICE:  trigger = foo_s_trig, old table = (1,), depth = 1
> NOTICE:  trigger = foo_r_trig, old table = (2,1), depth = 1
> NOTICE:  trigger = foo_s_trig, old table = (2,1), depth = 1
> NOTICE:  trigger = foo_s_trig, old table = <NULL>, depth = 1
> DELETE 1

OK, now that I'm a little more awake, I looked at this, and I think it's
actually an instance of the "closed transition table" behavior you were
asking about.  Initially, we perform the delete of the (1,null) row,
and that queues AFTER ROW triggers -- both the RI enforcement one and
the explicit foo_r_trig one -- and puts the row into the transition table.
When the executor finishes that scan it queues the foo_s_trig AFTER
STATEMENT trigger.  Then we start to execute the triggers, and as I have
the patch now, it first marks all the transition tables closed.  I think
that the RI enforcement trigger fires before foo_r_trig on the basis of
name order, but it doesn't actually matter if it fires first or second.
Either way, it causes a cascaded delete of the (2,1) row, and again
we queue AFTER ROW triggers and put the row into the transition table.
But now, since the first transition table was already marked closed,
we create a new transition table and put (2,1) into it.  And then we
queue foo_s_trig, and since we're looking at a new (not closed)
AfterTriggersTableData struct, that's allowed to happen.  Then we
fire foo_r_trig and foo_s_trig referencing the original transition
table, which produce the first two NOTICE lines.  Then we repeat
this entire process with the newly queued triggers.  The second
invocation of the RI enforcement trigger doesn't find any rows to
delete, but it nonetheless causes queuing of a third AFTER STATEMENT
trigger, which eventually gets to run with an empty transition table.

So this is a bit annoying because we're marking the transition table
closed before the RI triggers can have the desired effects.  I wonder
if we can rejigger things so that the tables are only closed when
actually used.
        regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

15 September 2017, 22:26:35

I wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> What is going on here?
>> ...
>> insert into foo values (1, null), (2, 1);
>> 
>> postgres=# delete from foo where a = 1;
>> NOTICE:  trigger = foo_r_trig, old table = (1,), depth = 1
>> NOTICE:  trigger = foo_s_trig, old table = (1,), depth = 1
>> NOTICE:  trigger = foo_r_trig, old table = (2,1), depth = 1
>> NOTICE:  trigger = foo_s_trig, old table = (2,1), depth = 1
>> NOTICE:  trigger = foo_s_trig, old table = <NULL>, depth = 1
>> DELETE 1

> OK, now that I'm a little more awake, I looked at this, and I think it's
> actually an instance of the "closed transition table" behavior you were
> asking about.

Attached is an updated patch that incorporates the ideas you suggested.
I added an extended version of this example, which has an additional
level of FK cascade happening:

insert into self_ref values (1, null), (2, 1), (3, 2);
delete from self_ref where a = 1;
NOTICE:  trigger = self_ref_r_trig, old table = (1,), (2,1)
NOTICE:  trigger = self_ref_r_trig, old table = (1,), (2,1)
NOTICE:  trigger = self_ref_s_trig, old table = (1,), (2,1)
NOTICE:  trigger = self_ref_r_trig, old table = (3,2)
NOTICE:  trigger = self_ref_s_trig, old table = (3,2)

What happens here is that the outer delete queues AR triggers for
RI enforcement and self_ref_r_trig, plus an AS trigger for
self_ref_s_trig.  Then the RI enforcement trigger deletes (2,1)
and queues AR+AS triggers for that.  At this point the initial
transition table is still open, so (2,1) goes into that table,
and we look back and cancel the previous queuing of self_ref_s_trig.
Now it's time for the first firing of self_ref_r_trig, and so now
we mark the transition table closed.  Then we skip the cancelled
self_ref_s_trig call, and then it's time for the second RI enforcement
trigger to fire, which deletes (3,2) but has to put it into a new
transition table.  Again we queue AR+AS triggers, but this time we
can't cancel the preceding AS call.  Then we fire self_ref_r_trig
again (for the (2,1) row), and then fire self_ref_s_trig; both of
them see the same transition table the first self_ref_r_trig call
did.  Now it's time for the third RI enforcement trigger; it finds
nothing to delete, so it adds nothing to the second transition table,
but it does queue an AS trigger call (canceling the one added by the
second RI trigger).  Finally we have the AR call queued by the second
RI trigger, and then the AS call queued by the third RI trigger,
both looking at the second transition table.

This is pretty messy but I think it's the best we can do as long as
RI actions are intermixed with other AFTER ROW triggers.  Maybe with
Kevin's ideas about converting RI actions to be statement-level,
we could arrange for all three deletions to show up in one transition
table ... but I don't know how we cause that to happen before the
user's AFTER ROW triggers run.  In any case, nothing will look odd
unless you have AR triggers using transition tables, which seems like
a niche usage case in the first place.

I also realized that we could undo the bloat I added to
AfterTriggerSharedData by storing a pointer to the AfterTriggersTableData
rather than the tuplestores themselves.

I feel that this is probably committable, except that I still need
to look for documentation changes.

            regards, tom lane

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..c6fa445 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*************** CopyFrom(CopyState cstate)
*** 2432,2443 ****
      /* Triggers might need a slot as well */
      estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);

      /*
       * If there are any triggers with transition tables on the named relation,
       * we need to be prepared to capture transition tuples.
       */
      cstate->transition_capture =
!         MakeTransitionCaptureState(cstate->rel->trigdesc);

      /*
       * If the named relation is a partitioned table, initialize state for
--- 2432,2448 ----
      /* Triggers might need a slot as well */
      estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);

+     /* Prepare to catch AFTER triggers. */
+     AfterTriggerBeginQuery();
+
      /*
       * If there are any triggers with transition tables on the named relation,
       * we need to be prepared to capture transition tuples.
       */
      cstate->transition_capture =
!         MakeTransitionCaptureState(cstate->rel->trigdesc,
!                                    RelationGetRelid(cstate->rel),
!                                    CMD_INSERT);

      /*
       * If the named relation is a partitioned table, initialize state for
*************** CopyFrom(CopyState cstate)
*** 2513,2521 ****
          bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
      }

-     /* Prepare to catch AFTER triggers. */
-     AfterTriggerBeginQuery();
-
      /*
       * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
       * should do this for COPY, since it's not really an "INSERT" statement as
--- 2518,2523 ----
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 269c9e1..dfe1348 100644
*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
*************** CreateTrigger(CreateTrigStmt *stmt, cons
*** 234,239 ****
--- 234,244 ----
                              RelationGetRelationName(rel)),
                       errdetail("Foreign tables cannot have TRUNCATE triggers.")));

+         /*
+          * We disallow constraint triggers to protect the assumption that
+          * triggers on FKs can't be deferred.  See notes with AfterTriggers
+          * data structures, below.
+          */
          if (stmt->isconstraint)
              ereport(ERROR,
                      (errcode(ERRCODE_WRONG_OBJECT_TYPE),
*************** CreateTrigger(CreateTrigStmt *stmt, cons
*** 418,423 ****
--- 423,448 ----
                          (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                           errmsg("transition tables cannot be specified for triggers with more than one event")));

+             /*
+              * We currently don't allow column-specific triggers with
+              * transition tables.  Per spec, that seems to require
+              * accumulating separate transition tables for each combination of
+              * columns, which is a lot of work for a rather marginal feature.
+              */
+             if (stmt->columns != NIL)
+                 ereport(ERROR,
+                         (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                          errmsg("transition tables cannot be specified for triggers with column lists")));
+
+             /*
+              * We disallow constraint triggers with transition tables, to
+              * protect the assumption that such triggers can't be deferred.
+              * See notes with AfterTriggers data structures, below.
+              *
+              * Currently this is enforced by the grammar, so just Assert here.
+              */
+             Assert(!stmt->isconstraint);
+
              if (tt->isNew)
              {
                  if (!(TRIGGER_FOR_INSERT(tgtype) ||
*************** FindTriggerIncompatibleWithInheritance(T
*** 2086,2181 ****
  }

  /*
-  * Make a TransitionCaptureState object from a given TriggerDesc.  The
-  * resulting object holds the flags which control whether transition tuples
-  * are collected when tables are modified, and the tuplestores themselves.
-  * Note that we copy the flags from a parent table into this struct (rather
-  * than using each relation's TriggerDesc directly) so that we can use it to
-  * control the collection of transition tuples from child tables.
-  *
-  * If there are no triggers with transition tables configured for 'trigdesc',
-  * then return NULL.
-  *
-  * The resulting object can be passed to the ExecAR* functions.  The caller
-  * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
-  * with child tables.
-  */
- TransitionCaptureState *
- MakeTransitionCaptureState(TriggerDesc *trigdesc)
- {
-     TransitionCaptureState *state = NULL;
-
-     if (trigdesc != NULL &&
-         (trigdesc->trig_delete_old_table || trigdesc->trig_update_old_table ||
-          trigdesc->trig_update_new_table || trigdesc->trig_insert_new_table))
-     {
-         MemoryContext oldcxt;
-         ResourceOwner saveResourceOwner;
-
-         /*
-          * Normally DestroyTransitionCaptureState should be called after
-          * executing all AFTER triggers for the current statement.
-          *
-          * To handle error cleanup, TransitionCaptureState and the tuplestores
-          * it contains will live in the current [sub]transaction's memory
-          * context.  Likewise for the current resource owner, because we also
-          * want to clean up temporary files spilled to disk by the tuplestore
-          * in that scenario.  This scope is sufficient, because AFTER triggers
-          * with transition tables cannot be deferred (only constraint triggers
-          * can be deferred, and constraint triggers cannot have transition
-          * tables).  The AFTER trigger queue may contain pointers to this
-          * TransitionCaptureState, but any such entries will be processed or
-          * discarded before the end of the current [sub]transaction.
-          *
-          * If a future release allows deferred triggers with transition
-          * tables, we'll need to reconsider the scope of the
-          * TransitionCaptureState object.
-          */
-         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
-         saveResourceOwner = CurrentResourceOwner;
-
-         state = (TransitionCaptureState *)
-             palloc0(sizeof(TransitionCaptureState));
-         state->tcs_delete_old_table = trigdesc->trig_delete_old_table;
-         state->tcs_update_old_table = trigdesc->trig_update_old_table;
-         state->tcs_update_new_table = trigdesc->trig_update_new_table;
-         state->tcs_insert_new_table = trigdesc->trig_insert_new_table;
-         PG_TRY();
-         {
-             CurrentResourceOwner = CurTransactionResourceOwner;
-             if (trigdesc->trig_delete_old_table || trigdesc->trig_update_old_table)
-                 state->tcs_old_tuplestore = tuplestore_begin_heap(false, false, work_mem);
-             if (trigdesc->trig_insert_new_table)
-                 state->tcs_insert_tuplestore = tuplestore_begin_heap(false, false, work_mem);
-             if (trigdesc->trig_update_new_table)
-                 state->tcs_update_tuplestore = tuplestore_begin_heap(false, false, work_mem);
-         }
-         PG_CATCH();
-         {
-             CurrentResourceOwner = saveResourceOwner;
-             PG_RE_THROW();
-         }
-         PG_END_TRY();
-         CurrentResourceOwner = saveResourceOwner;
-         MemoryContextSwitchTo(oldcxt);
-     }
-
-     return state;
- }
-
- void
- DestroyTransitionCaptureState(TransitionCaptureState *tcs)
- {
-     if (tcs->tcs_insert_tuplestore != NULL)
-         tuplestore_end(tcs->tcs_insert_tuplestore);
-     if (tcs->tcs_update_tuplestore != NULL)
-         tuplestore_end(tcs->tcs_update_tuplestore);
-     if (tcs->tcs_old_tuplestore != NULL)
-         tuplestore_end(tcs->tcs_old_tuplestore);
-     pfree(tcs);
- }
-
- /*
   * Call a trigger function.
   *
   *        trigdata: trigger descriptor.
--- 2111,2116 ----
*************** TriggerEnabled(EState *estate, ResultRel
*** 3338,3346 ****
   * during the current transaction tree.  (BEFORE triggers are fired
   * immediately so we don't need any persistent state about them.)  The struct
   * and most of its subsidiary data are kept in TopTransactionContext; however
!  * the individual event records are kept in a separate sub-context.  This is
!  * done mainly so that it's easy to tell from a memory context dump how much
!  * space is being eaten by trigger events.
   *
   * Because the list of pending events can grow large, we go to some
   * considerable effort to minimize per-event memory consumption.  The event
--- 3273,3283 ----
   * during the current transaction tree.  (BEFORE triggers are fired
   * immediately so we don't need any persistent state about them.)  The struct
   * and most of its subsidiary data are kept in TopTransactionContext; however
!  * some data that can be discarded sooner appears in the CurTransactionContext
!  * of the relevant subtransaction.  Also, the individual event records are
!  * kept in a separate sub-context of TopTransactionContext.  This is done
!  * mainly so that it's easy to tell from a memory context dump how much space
!  * is being eaten by trigger events.
   *
   * Because the list of pending events can grow large, we go to some
   * considerable effort to minimize per-event memory consumption.  The event
*************** typedef SetConstraintStateData *SetConst
*** 3400,3405 ****
--- 3337,3349 ----
   * tuple(s).  This permits storing tuples once regardless of the number of
   * row-level triggers on a foreign table.
   *
+  * Note that we need triggers on foreign tables to be fired in exactly the
+  * order they were queued, so that the tuples come out of the tuplestore in
+  * the right order.  To ensure that, we forbid deferrable (constraint)
+  * triggers on foreign tables.  This also ensures that such triggers do not
+  * get deferred into outer trigger query levels, meaning that it's okay to
+  * destroy the tuplestore at the end of the query level.
+  *
   * Statement-level triggers always bear AFTER_TRIGGER_1CTID, though they
   * require no ctid field.  We lack the flag bit space to neatly represent that
   * distinct case, and it seems unlikely to be worth much trouble.
*************** typedef struct AfterTriggerSharedData
*** 3433,3439 ****
      Oid            ats_tgoid;        /* the trigger's ID */
      Oid            ats_relid;        /* the relation it's on */
      CommandId    ats_firing_id;    /* ID for firing cycle */
!     TransitionCaptureState *ats_transition_capture;
  } AfterTriggerSharedData;

  typedef struct AfterTriggerEventData *AfterTriggerEvent;
--- 3377,3383 ----
      Oid            ats_tgoid;        /* the trigger's ID */
      Oid            ats_relid;        /* the relation it's on */
      CommandId    ats_firing_id;    /* ID for firing cycle */
!     struct AfterTriggersTableData *ats_table;    /* transition table access */
  } AfterTriggerSharedData;

  typedef struct AfterTriggerEventData *AfterTriggerEvent;
*************** typedef struct AfterTriggerEventList
*** 3505,3510 ****
--- 3449,3462 ----
  #define for_each_event_chunk(eptr, cptr, evtlist) \
      for_each_chunk(cptr, evtlist) for_each_event(eptr, cptr)

+ /* Macros for iterating from a start point that might not be list start */
+ #define for_each_chunk_from(cptr) \
+     for (; cptr != NULL; cptr = cptr->next)
+ #define for_each_event_from(eptr, cptr) \
+     for (; \
+          (char *) eptr < (cptr)->freeptr; \
+          eptr = (AfterTriggerEvent) (((char *) eptr) + SizeofTriggerEvent(eptr)))
+

  /*
   * All per-transaction data for the AFTER TRIGGERS module.
*************** typedef struct AfterTriggerEventList
*** 3529,3588 ****
   * query_depth is the current depth of nested AfterTriggerBeginQuery calls
   * (-1 when the stack is empty).
   *
!  * query_stack[query_depth] is a list of AFTER trigger events queued by the
!  * current query (and the query_stack entries below it are lists of trigger
!  * events queued by calling queries).  None of these are valid until the
!  * matching AfterTriggerEndQuery call occurs.  At that point we fire
!  * immediate-mode triggers, and append any deferred events to the main events
!  * list.
   *
!  * fdw_tuplestores[query_depth] is a tuplestore containing the foreign tuples
!  * needed for the current query.
   *
!  * maxquerydepth is just the allocated length of query_stack and the
!  * tuplestores.
   *
!  * state_stack is a stack of pointers to saved copies of the SET CONSTRAINTS
!  * state data; each subtransaction level that modifies that state first
   * saves a copy, which we use to restore the state if we abort.
   *
!  * events_stack is a stack of copies of the events head/tail pointers,
   * which we use to restore those values during subtransaction abort.
   *
!  * depth_stack is a stack of copies of subtransaction-start-time query_depth,
   * which we similarly use to clean up at subtransaction abort.
   *
!  * firing_stack is a stack of copies of subtransaction-start-time
!  * firing_counter.  We use this to recognize which deferred triggers were
!  * fired (or marked for firing) within an aborted subtransaction.
   *
   * We use GetCurrentTransactionNestLevel() to determine the correct array
!  * index in these stacks.  maxtransdepth is the number of allocated entries in
!  * each stack.  (By not keeping our own stack pointer, we can avoid trouble
   * in cases where errors during subxact abort cause multiple invocations
   * of AfterTriggerEndSubXact() at the same nesting depth.)
   */
  typedef struct AfterTriggersData
  {
      CommandId    firing_counter; /* next firing ID to assign */
      SetConstraintState state;    /* the active S C state */
      AfterTriggerEventList events;    /* deferred-event list */
-     int            query_depth;    /* current query list index */
-     AfterTriggerEventList *query_stack; /* events pending from each query */
-     Tuplestorestate **fdw_tuplestores;    /* foreign tuples for one row from
-                                          * each query */
-     int            maxquerydepth;    /* allocated len of above array */
      MemoryContext event_cxt;    /* memory context for events, if any */

!     /* these fields are just for resetting at subtrans abort: */

!     SetConstraintState *state_stack;    /* stacked S C states */
!     AfterTriggerEventList *events_stack;    /* stacked list pointers */
!     int           *depth_stack;    /* stacked query_depths */
!     CommandId  *firing_stack;    /* stacked firing_counters */
!     int            maxtransdepth;    /* allocated len of above arrays */
  } AfterTriggersData;

  static AfterTriggersData afterTriggers;

  static void AfterTriggerExecute(AfterTriggerEvent event,
--- 3481,3587 ----
   * query_depth is the current depth of nested AfterTriggerBeginQuery calls
   * (-1 when the stack is empty).
   *
!  * query_stack[query_depth] is the per-query-level data, including these fields:
   *
!  * events is a list of AFTER trigger events queued by the current query.
!  * None of these are valid until the matching AfterTriggerEndQuery call
!  * occurs.  At that point we fire immediate-mode triggers, and append any
!  * deferred events to the main events list.
   *
!  * fdw_tuplestore is a tuplestore containing the foreign-table tuples
!  * needed by events queued by the current query.  (Note: we use just one
!  * tuplestore even though more than one foreign table might be involved.
!  * This is okay because tuplestores don't really care what's in the tuples
!  * they store; but it's possible that someday it'd break.)
   *
!  * tables is a List of AfterTriggersTableData structs for target tables
!  * of the current query (see below).
!  *
!  * maxquerydepth is just the allocated length of query_stack.
!  *
!  * trans_stack holds per-subtransaction data, including these fields:
!  *
!  * state is NULL or a pointer to a saved copy of the SET CONSTRAINTS
!  * state data.  Each subtransaction level that modifies that state first
   * saves a copy, which we use to restore the state if we abort.
   *
!  * events is a copy of the events head/tail pointers,
   * which we use to restore those values during subtransaction abort.
   *
!  * query_depth is the subtransaction-start-time value of query_depth,
   * which we similarly use to clean up at subtransaction abort.
   *
!  * firing_counter is the subtransaction-start-time value of firing_counter.
!  * We use this to recognize which deferred triggers were fired (or marked
!  * for firing) within an aborted subtransaction.
   *
   * We use GetCurrentTransactionNestLevel() to determine the correct array
!  * index in trans_stack.  maxtransdepth is the number of allocated entries in
!  * trans_stack.  (By not keeping our own stack pointer, we can avoid trouble
   * in cases where errors during subxact abort cause multiple invocations
   * of AfterTriggerEndSubXact() at the same nesting depth.)
+  *
+  * We create an AfterTriggersTableData struct for each target table of the
+  * current query, and each operation mode (INSERT/UPDATE/DELETE), that has
+  * either transition tables or AFTER STATEMENT triggers.  This is used to
+  * hold the relevant transition tables, as well as info tracking whether
+  * we already queued the AFTER STATEMENT triggers.  (We use that info to
+  * prevent, as much as possible, firing the same AFTER STATEMENT trigger
+  * more than once per statement.)  These structs, along with the transition
+  * table tuplestores, live in the (sub)transaction's CurTransactionContext.
+  * That's sufficient lifespan because we don't allow transition tables to be
+  * used by deferrable triggers, so they only need to survive until
+  * AfterTriggerEndQuery.
   */
+ typedef struct AfterTriggersQueryData AfterTriggersQueryData;
+ typedef struct AfterTriggersTransData AfterTriggersTransData;
+ typedef struct AfterTriggersTableData AfterTriggersTableData;
+
  typedef struct AfterTriggersData
  {
      CommandId    firing_counter; /* next firing ID to assign */
      SetConstraintState state;    /* the active S C state */
      AfterTriggerEventList events;    /* deferred-event list */
      MemoryContext event_cxt;    /* memory context for events, if any */

!     /* per-query-level data: */
!     AfterTriggersQueryData *query_stack;    /* array of structs shown above */
!     int            query_depth;    /* current index in above array */
!     int            maxquerydepth;    /* allocated len of above array */

!     /* per-subtransaction-level data: */
!     AfterTriggersTransData *trans_stack;    /* array of structs shown above */
!     int            maxtransdepth;    /* allocated len of above array */
  } AfterTriggersData;

+ struct AfterTriggersQueryData
+ {
+     AfterTriggerEventList events;    /* events pending from this query */
+     Tuplestorestate *fdw_tuplestore;    /* foreign tuples for said events */
+     List       *tables;            /* list of AfterTriggersTableData */
+ };
+
+ struct AfterTriggersTransData
+ {
+     /* these fields are just for resetting at subtrans abort: */
+     SetConstraintState state;    /* saved S C state, or NULL if not yet saved */
+     AfterTriggerEventList events;    /* saved list pointer */
+     int            query_depth;    /* saved query_depth */
+     CommandId    firing_counter; /* saved firing_counter */
+ };
+
+ struct AfterTriggersTableData
+ {
+     /* relid + cmdType form the lookup key for these structs: */
+     Oid            relid;            /* target table's OID */
+     CmdType        cmdType;        /* event type, CMD_INSERT/UPDATE/DELETE */
+     bool        closed;            /* true when no longer OK to add tuples */
+     bool        stmt_trig_done; /* did we already queue stmt-level triggers? */
+     AfterTriggerEventList stmt_trig_events; /* if so, saved list pointer */
+     Tuplestorestate *old_tuplestore;    /* "old" transition table, if any */
+     Tuplestorestate *new_tuplestore;    /* "new" transition table, if any */
+ };
+
  static AfterTriggersData afterTriggers;

  static void AfterTriggerExecute(AfterTriggerEvent event,
*************** static void AfterTriggerExecute(AfterTri
*** 3591,3628 ****
                      Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2,
!                     TransitionCaptureState *transition_capture);
  static SetConstraintState SetConstraintStateCreate(int numalloc);
  static SetConstraintState SetConstraintStateCopy(SetConstraintState state);
  static SetConstraintState SetConstraintStateAddItem(SetConstraintState state,
                            Oid tgoid, bool tgisdeferred);


  /*
!  * Gets a current query transition tuplestore and initializes it if necessary.
   */
  static Tuplestorestate *
! GetTriggerTransitionTuplestore(Tuplestorestate **tss)
  {
      Tuplestorestate *ret;

!     ret = tss[afterTriggers.query_depth];
      if (ret == NULL)
      {
          MemoryContext oldcxt;
          ResourceOwner saveResourceOwner;

          /*
!          * Make the tuplestore valid until end of transaction.  This is the
!          * allocation lifespan of the associated events list, but we really
           * only need it until AfterTriggerEndQuery().
           */
!         oldcxt = MemoryContextSwitchTo(TopTransactionContext);
          saveResourceOwner = CurrentResourceOwner;
          PG_TRY();
          {
!             CurrentResourceOwner = TopTransactionResourceOwner;
              ret = tuplestore_begin_heap(false, false, work_mem);
          }
          PG_CATCH();
--- 3590,3630 ----
                      Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2);
! static AfterTriggersTableData *GetAfterTriggersTableData(Oid relid,
!                           CmdType cmdType);
! static void AfterTriggerFreeQuery(AfterTriggersQueryData *qs);
  static SetConstraintState SetConstraintStateCreate(int numalloc);
  static SetConstraintState SetConstraintStateCopy(SetConstraintState state);
  static SetConstraintState SetConstraintStateAddItem(SetConstraintState state,
                            Oid tgoid, bool tgisdeferred);
+ static void cancel_prior_stmt_triggers(Oid relid, CmdType cmdType, int tgevent);


  /*
!  * Get the FDW tuplestore for the current trigger query level, creating it
!  * if necessary.
   */
  static Tuplestorestate *
! GetCurrentFDWTuplestore(void)
  {
      Tuplestorestate *ret;

!     ret = afterTriggers.query_stack[afterTriggers.query_depth].fdw_tuplestore;
      if (ret == NULL)
      {
          MemoryContext oldcxt;
          ResourceOwner saveResourceOwner;

          /*
!          * Make the tuplestore valid until end of subtransaction.  We really
           * only need it until AfterTriggerEndQuery().
           */
!         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
          saveResourceOwner = CurrentResourceOwner;
          PG_TRY();
          {
!             CurrentResourceOwner = CurTransactionResourceOwner;
              ret = tuplestore_begin_heap(false, false, work_mem);
          }
          PG_CATCH();
*************** GetTriggerTransitionTuplestore(Tuplestor
*** 3634,3640 ****
          CurrentResourceOwner = saveResourceOwner;
          MemoryContextSwitchTo(oldcxt);

!         tss[afterTriggers.query_depth] = ret;
      }

      return ret;
--- 3636,3642 ----
          CurrentResourceOwner = saveResourceOwner;
          MemoryContextSwitchTo(oldcxt);

!         afterTriggers.query_stack[afterTriggers.query_depth].fdw_tuplestore = ret;
      }

      return ret;
*************** afterTriggerAddEvent(AfterTriggerEventLi
*** 3780,3786 ****
          if (newshared->ats_tgoid == evtshared->ats_tgoid &&
              newshared->ats_relid == evtshared->ats_relid &&
              newshared->ats_event == evtshared->ats_event &&
!             newshared->ats_transition_capture == evtshared->ats_transition_capture &&
              newshared->ats_firing_id == 0)
              break;
      }
--- 3782,3788 ----
          if (newshared->ats_tgoid == evtshared->ats_tgoid &&
              newshared->ats_relid == evtshared->ats_relid &&
              newshared->ats_event == evtshared->ats_event &&
!             newshared->ats_table == evtshared->ats_table &&
              newshared->ats_firing_id == 0)
              break;
      }
*************** AfterTriggerExecute(AfterTriggerEvent ev
*** 3892,3899 ****
                      FmgrInfo *finfo, Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2,
!                     TransitionCaptureState *transition_capture)
  {
      AfterTriggerShared evtshared = GetTriggerSharedData(event);
      Oid            tgoid = evtshared->ats_tgoid;
--- 3894,3900 ----
                      FmgrInfo *finfo, Instrumentation *instr,
                      MemoryContext per_tuple_context,
                      TupleTableSlot *trig_tuple_slot1,
!                     TupleTableSlot *trig_tuple_slot2)
  {
      AfterTriggerShared evtshared = GetTriggerSharedData(event);
      Oid            tgoid = evtshared->ats_tgoid;
*************** AfterTriggerExecute(AfterTriggerEvent ev
*** 3934,3942 ****
      {
          case AFTER_TRIGGER_FDW_FETCH:
              {
!                 Tuplestorestate *fdw_tuplestore =
!                 GetTriggerTransitionTuplestore
!                 (afterTriggers.fdw_tuplestores);

                  if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
                                               trig_tuple_slot1))
--- 3935,3941 ----
      {
          case AFTER_TRIGGER_FDW_FETCH:
              {
!                 Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();

                  if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
                                               trig_tuple_slot1))
*************** AfterTriggerExecute(AfterTriggerEvent ev
*** 4006,4041 ****
      }

      /*
!      * Set up the tuplestore information.
       */
      LocTriggerData.tg_oldtable = LocTriggerData.tg_newtable = NULL;
!     if (transition_capture != NULL)
      {
          if (LocTriggerData.tg_trigger->tgoldtable)
-             LocTriggerData.tg_oldtable = transition_capture->tcs_old_tuplestore;
-         if (LocTriggerData.tg_trigger->tgnewtable)
          {
!             /*
!              * Currently a trigger with transition tables may only be defined
!              * for a single event type (here AFTER INSERT or AFTER UPDATE, but
!              * not AFTER INSERT OR ...).
!              */
!             Assert((TRIGGER_FOR_INSERT(LocTriggerData.tg_trigger->tgtype) != 0) ^
!                    (TRIGGER_FOR_UPDATE(LocTriggerData.tg_trigger->tgtype) != 0));

!             /*
!              * Show either the insert or update new tuple images, depending on
!              * which event type the trigger was registered for.  A single
!              * statement may have produced both in the case of INSERT ... ON
!              * CONFLICT ... DO UPDATE, and in that case the event determines
!              * which tuplestore the trigger sees as the NEW TABLE.
!              */
!             if (TRIGGER_FOR_INSERT(LocTriggerData.tg_trigger->tgtype))
!                 LocTriggerData.tg_newtable =
!                     transition_capture->tcs_insert_tuplestore;
!             else
!                 LocTriggerData.tg_newtable =
!                     transition_capture->tcs_update_tuplestore;
          }
      }

--- 4005,4029 ----
      }

      /*
!      * Set up the tuplestore information to let the trigger have access to
!      * transition tables.  When we first make a transition table available to
!      * a trigger, mark it "closed" so that it cannot change anymore.  If any
!      * additional events of the same type get queued in the current trigger
!      * query level, they'll go into new transition tables.
       */
      LocTriggerData.tg_oldtable = LocTriggerData.tg_newtable = NULL;
!     if (evtshared->ats_table)
      {
          if (LocTriggerData.tg_trigger->tgoldtable)
          {
!             LocTriggerData.tg_oldtable = evtshared->ats_table->old_tuplestore;
!             evtshared->ats_table->closed = true;
!         }

!         if (LocTriggerData.tg_trigger->tgnewtable)
!         {
!             LocTriggerData.tg_newtable = evtshared->ats_table->new_tuplestore;
!             evtshared->ats_table->closed = true;
          }
      }

*************** afterTriggerInvokeEvents(AfterTriggerEve
*** 4245,4252 ****
                   * won't try to re-fire it.
                   */
                  AfterTriggerExecute(event, rel, trigdesc, finfo, instr,
!                                     per_tuple_context, slot1, slot2,
!                                     evtshared->ats_transition_capture);

                  /*
                   * Mark the event as done.
--- 4233,4239 ----
                   * won't try to re-fire it.
                   */
                  AfterTriggerExecute(event, rel, trigdesc, finfo, instr,
!                                     per_tuple_context, slot1, slot2);

                  /*
                   * Mark the event as done.
*************** afterTriggerInvokeEvents(AfterTriggerEve
*** 4296,4301 ****
--- 4283,4448 ----
  }


+ /*
+  * GetAfterTriggersTableData
+  *
+  * Find or create an AfterTriggersTableData struct for the specified
+  * trigger event (relation + operation type).  Ignore existing structs
+  * marked "closed"; we don't want to put any additional tuples into them,
+  * nor change their stmt-triggers-fired state.
+  *
+  * Note: the AfterTriggersTableData list is allocated in the current
+  * (sub)transaction's CurTransactionContext.  This is OK because
+  * we don't need it to live past AfterTriggerEndQuery.
+  */
+ static AfterTriggersTableData *
+ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
+ {
+     AfterTriggersTableData *table;
+     AfterTriggersQueryData *qs;
+     MemoryContext oldcxt;
+     ListCell   *lc;
+
+     /* Caller should have ensured query_depth is OK. */
+     Assert(afterTriggers.query_depth >= 0 &&
+            afterTriggers.query_depth < afterTriggers.maxquerydepth);
+     qs = &afterTriggers.query_stack[afterTriggers.query_depth];
+
+     foreach(lc, qs->tables)
+     {
+         table = (AfterTriggersTableData *) lfirst(lc);
+         if (table->relid == relid && table->cmdType == cmdType &&
+             !table->closed)
+             return table;
+     }
+
+     oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+
+     table = (AfterTriggersTableData *) palloc0(sizeof(AfterTriggersTableData));
+     table->relid = relid;
+     table->cmdType = cmdType;
+     qs->tables = lappend(qs->tables, table);
+
+     MemoryContextSwitchTo(oldcxt);
+
+     return table;
+ }
+
+
+ /*
+  * MakeTransitionCaptureState
+  *
+  * Make a TransitionCaptureState object for the given TriggerDesc, target
+  * relation, and operation type.  The TCS object holds all the state needed
+  * to decide whether to capture tuples in transition tables.
+  *
+  * If there are no triggers in 'trigdesc' that request relevant transition
+  * tables, then return NULL.
+  *
+  * The resulting object can be passed to the ExecAR* functions.  The caller
+  * should set tcs_map or tcs_original_insert_tuple as appropriate when dealing
+  * with child tables.
+  *
+  * Note that we copy the flags from a parent table into this struct (rather
+  * than subsequently using the relation's TriggerDesc directly) so that we can
+  * use it to control collection of transition tuples from child tables.
+  *
+  * Per SQL spec, all operations of the same kind (INSERT/UPDATE/DELETE)
+  * on the same table during one query should share one transition table.
+  * Therefore, the Tuplestores are owned by an AfterTriggersTableData struct
+  * looked up using the table OID + CmdType, and are merely referenced by
+  * the TransitionCaptureState objects we hand out to callers.
+  */
+ TransitionCaptureState *
+ MakeTransitionCaptureState(TriggerDesc *trigdesc, Oid relid, CmdType cmdType)
+ {
+     TransitionCaptureState *state;
+     bool        need_old,
+                 need_new;
+     AfterTriggersTableData *table;
+     MemoryContext oldcxt;
+     ResourceOwner saveResourceOwner;
+
+     if (trigdesc == NULL)
+         return NULL;
+
+     /* Detect which table(s) we need. */
+     switch (cmdType)
+     {
+         case CMD_INSERT:
+             need_old = false;
+             need_new = trigdesc->trig_insert_new_table;
+             break;
+         case CMD_UPDATE:
+             need_old = trigdesc->trig_update_old_table;
+             need_new = trigdesc->trig_update_new_table;
+             break;
+         case CMD_DELETE:
+             need_old = trigdesc->trig_delete_old_table;
+             need_new = false;
+             break;
+         default:
+             elog(ERROR, "unexpected CmdType: %d", (int) cmdType);
+             need_old = need_new = false;    /* keep compiler quiet */
+             break;
+     }
+     if (!need_old && !need_new)
+         return NULL;
+
+     /* Check state, like AfterTriggerSaveEvent. */
+     if (afterTriggers.query_depth < 0)
+         elog(ERROR, "MakeTransitionCaptureState() called outside of query");
+
+     /* Be sure we have enough space to record events at this query depth. */
+     if (afterTriggers.query_depth >= afterTriggers.maxquerydepth)
+         AfterTriggerEnlargeQueryState();
+
+     /*
+      * Find or create an AfterTriggersTableData struct to hold the
+      * tuplestore(s).  If there's a matching struct but it's marked closed,
+      * ignore it; we need a newer one.
+      *
+      * Note: the AfterTriggersTableData list, as well as the tuplestores, are
+      * allocated in the current (sub)transaction's CurTransactionContext, and
+      * the tuplestores are managed by the (sub)transaction's resource owner.
+      * This is sufficient lifespan because we do not allow triggers using
+      * transition tables to be deferrable; they will be fired during
+      * AfterTriggerEndQuery, after which it's okay to delete the data.
+      */
+     table = GetAfterTriggersTableData(relid, cmdType);
+
+     /* Now create required tuplestore(s), if we don't have them already. */
+     oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+     saveResourceOwner = CurrentResourceOwner;
+     PG_TRY();
+     {
+         CurrentResourceOwner = CurTransactionResourceOwner;
+         if (need_old && table->old_tuplestore == NULL)
+             table->old_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+         if (need_new && table->new_tuplestore == NULL)
+             table->new_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+     }
+     PG_CATCH();
+     {
+         CurrentResourceOwner = saveResourceOwner;
+         PG_RE_THROW();
+     }
+     PG_END_TRY();
+     CurrentResourceOwner = saveResourceOwner;
+     MemoryContextSwitchTo(oldcxt);
+
+     /* Now build the TransitionCaptureState struct, in caller's context */
+     state = (TransitionCaptureState *) palloc0(sizeof(TransitionCaptureState));
+     state->tcs_delete_old_table = trigdesc->trig_delete_old_table;
+     state->tcs_update_old_table = trigdesc->trig_update_old_table;
+     state->tcs_update_new_table = trigdesc->trig_update_new_table;
+     state->tcs_insert_new_table = trigdesc->trig_insert_new_table;
+     state->tcs_private = table;
+
+     return state;
+ }
+
+
  /* ----------
   * AfterTriggerBeginXact()
   *
*************** AfterTriggerBeginXact(void)
*** 4319,4332 ****
       */
      Assert(afterTriggers.state == NULL);
      Assert(afterTriggers.query_stack == NULL);
-     Assert(afterTriggers.fdw_tuplestores == NULL);
      Assert(afterTriggers.maxquerydepth == 0);
      Assert(afterTriggers.event_cxt == NULL);
      Assert(afterTriggers.events.head == NULL);
!     Assert(afterTriggers.state_stack == NULL);
!     Assert(afterTriggers.events_stack == NULL);
!     Assert(afterTriggers.depth_stack == NULL);
!     Assert(afterTriggers.firing_stack == NULL);
      Assert(afterTriggers.maxtransdepth == 0);
  }

--- 4466,4475 ----
       */
      Assert(afterTriggers.state == NULL);
      Assert(afterTriggers.query_stack == NULL);
      Assert(afterTriggers.maxquerydepth == 0);
      Assert(afterTriggers.event_cxt == NULL);
      Assert(afterTriggers.events.head == NULL);
!     Assert(afterTriggers.trans_stack == NULL);
      Assert(afterTriggers.maxtransdepth == 0);
  }

*************** AfterTriggerBeginQuery(void)
*** 4362,4370 ****
  void
  AfterTriggerEndQuery(EState *estate)
  {
-     AfterTriggerEventList *events;
-     Tuplestorestate *fdw_tuplestore;
-
      /* Must be inside a query, too */
      Assert(afterTriggers.query_depth >= 0);

--- 4505,4510 ----
*************** AfterTriggerEndQuery(EState *estate)
*** 4393,4430 ****
       * will instead fire any triggers in a dedicated query level.  Foreign key
       * enforcement triggers do add to the current query level, thanks to their
       * passing fire_triggers = false to SPI_execute_snapshot().  Other
!      * C-language triggers might do likewise.  Be careful here: firing a
!      * trigger could result in query_stack being repalloc'd, so we can't save
!      * its address across afterTriggerInvokeEvents calls.
       *
       * If we find no firable events, we don't have to increment
       * firing_counter.
       */
      for (;;)
      {
!         events = &afterTriggers.query_stack[afterTriggers.query_depth];
!         if (afterTriggerMarkEvents(events, &afterTriggers.events, true))
          {
              CommandId    firing_id = afterTriggers.firing_counter++;

              /* OK to delete the immediate events after processing them */
!             if (afterTriggerInvokeEvents(events, firing_id, estate, true))
                  break;            /* all fired */
          }
          else
              break;
      }

!     /* Release query-local storage for events, including tuplestore if any */
!     fdw_tuplestore = afterTriggers.fdw_tuplestores[afterTriggers.query_depth];
!     if (fdw_tuplestore)
      {
!         tuplestore_end(fdw_tuplestore);
!         afterTriggers.fdw_tuplestores[afterTriggers.query_depth] = NULL;
      }
-     afterTriggerFreeEventList(&afterTriggers.query_stack[afterTriggers.query_depth]);

!     afterTriggers.query_depth--;
  }


--- 4533,4618 ----
       * will instead fire any triggers in a dedicated query level.  Foreign key
       * enforcement triggers do add to the current query level, thanks to their
       * passing fire_triggers = false to SPI_execute_snapshot().  Other
!      * C-language triggers might do likewise.
       *
       * If we find no firable events, we don't have to increment
       * firing_counter.
       */
      for (;;)
      {
!         AfterTriggersQueryData *qs;
!
!         /*
!          * Firing a trigger could result in query_stack being repalloc'd, so
!          * we must recalculate qs after each afterTriggerInvokeEvents call.
!          */
!         qs = &afterTriggers.query_stack[afterTriggers.query_depth];
!
!         if (afterTriggerMarkEvents(&qs->events, &afterTriggers.events, true))
          {
              CommandId    firing_id = afterTriggers.firing_counter++;

              /* OK to delete the immediate events after processing them */
!             if (afterTriggerInvokeEvents(&qs->events, firing_id, estate, true))
                  break;            /* all fired */
          }
          else
              break;
      }

!     /* Release query-level-local storage, including tuplestores if any */
!     AfterTriggerFreeQuery(&afterTriggers.query_stack[afterTriggers.query_depth]);
!
!     afterTriggers.query_depth--;
! }
!
!
! /*
!  * AfterTriggerFreeQuery
!  *    Release subsidiary storage for a trigger query level.
!  *    This includes closing down tuplestores.
!  *    Note: it's important for this to be safe if interrupted by an error
!  *    and then called again for the same query level.
!  */
! static void
! AfterTriggerFreeQuery(AfterTriggersQueryData *qs)
! {
!     Tuplestorestate *ts;
!     List       *tables;
!     ListCell   *lc;
!
!     /* Drop the trigger events */
!     afterTriggerFreeEventList(&qs->events);
!
!     /* Drop FDW tuplestore if any */
!     ts = qs->fdw_tuplestore;
!     qs->fdw_tuplestore = NULL;
!     if (ts)
!         tuplestore_end(ts);
!
!     /* Release per-table subsidiary storage */
!     tables = qs->tables;
!     foreach(lc, tables)
      {
!         AfterTriggersTableData *table = (AfterTriggersTableData *) lfirst(lc);
!
!         ts = table->old_tuplestore;
!         table->old_tuplestore = NULL;
!         if (ts)
!             tuplestore_end(ts);
!         ts = table->new_tuplestore;
!         table->new_tuplestore = NULL;
!         if (ts)
!             tuplestore_end(ts);
      }

!     /*
!      * Now free the AfterTriggersTableData structs and list cells.  Reset list
!      * pointer first; if list_free_deep somehow gets an error, better to leak
!      * that storage than have an infinite loop.
!      */
!     qs->tables = NIL;
!     list_free_deep(tables);
  }


*************** AfterTriggerEndXact(bool isCommit)
*** 4521,4530 ****
       * large, we let the eventual reset of TopTransactionContext free the
       * memory instead of doing it here.
       */
!     afterTriggers.state_stack = NULL;
!     afterTriggers.events_stack = NULL;
!     afterTriggers.depth_stack = NULL;
!     afterTriggers.firing_stack = NULL;
      afterTriggers.maxtransdepth = 0;


--- 4709,4715 ----
       * large, we let the eventual reset of TopTransactionContext free the
       * memory instead of doing it here.
       */
!     afterTriggers.trans_stack = NULL;
      afterTriggers.maxtransdepth = 0;


*************** AfterTriggerEndXact(bool isCommit)
*** 4534,4540 ****
       * memory here.
       */
      afterTriggers.query_stack = NULL;
-     afterTriggers.fdw_tuplestores = NULL;
      afterTriggers.maxquerydepth = 0;
      afterTriggers.state = NULL;

--- 4719,4724 ----
*************** AfterTriggerBeginSubXact(void)
*** 4553,4600 ****
      int            my_level = GetCurrentTransactionNestLevel();

      /*
!      * Allocate more space in the stacks if needed.  (Note: because the
       * minimum nest level of a subtransaction is 2, we waste the first couple
!      * entries of each array; not worth the notational effort to avoid it.)
       */
      while (my_level >= afterTriggers.maxtransdepth)
      {
          if (afterTriggers.maxtransdepth == 0)
          {
!             MemoryContext old_cxt;
!
!             old_cxt = MemoryContextSwitchTo(TopTransactionContext);
!
! #define DEFTRIG_INITALLOC 8
!             afterTriggers.state_stack = (SetConstraintState *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(SetConstraintState));
!             afterTriggers.events_stack = (AfterTriggerEventList *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(AfterTriggerEventList));
!             afterTriggers.depth_stack = (int *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(int));
!             afterTriggers.firing_stack = (CommandId *)
!                 palloc(DEFTRIG_INITALLOC * sizeof(CommandId));
!             afterTriggers.maxtransdepth = DEFTRIG_INITALLOC;
!
!             MemoryContextSwitchTo(old_cxt);
          }
          else
          {
!             /* repalloc will keep the stacks in the same context */
              int            new_alloc = afterTriggers.maxtransdepth * 2;

!             afterTriggers.state_stack = (SetConstraintState *)
!                 repalloc(afterTriggers.state_stack,
!                          new_alloc * sizeof(SetConstraintState));
!             afterTriggers.events_stack = (AfterTriggerEventList *)
!                 repalloc(afterTriggers.events_stack,
!                          new_alloc * sizeof(AfterTriggerEventList));
!             afterTriggers.depth_stack = (int *)
!                 repalloc(afterTriggers.depth_stack,
!                          new_alloc * sizeof(int));
!             afterTriggers.firing_stack = (CommandId *)
!                 repalloc(afterTriggers.firing_stack,
!                          new_alloc * sizeof(CommandId));
              afterTriggers.maxtransdepth = new_alloc;
          }
      }
--- 4737,4764 ----
      int            my_level = GetCurrentTransactionNestLevel();

      /*
!      * Allocate more space in the trans_stack if needed.  (Note: because the
       * minimum nest level of a subtransaction is 2, we waste the first couple
!      * entries of the array; not worth the notational effort to avoid it.)
       */
      while (my_level >= afterTriggers.maxtransdepth)
      {
          if (afterTriggers.maxtransdepth == 0)
          {
!             /* Arbitrarily initialize for max of 8 subtransaction levels */
!             afterTriggers.trans_stack = (AfterTriggersTransData *)
!                 MemoryContextAlloc(TopTransactionContext,
!                                    8 * sizeof(AfterTriggersTransData));
!             afterTriggers.maxtransdepth = 8;
          }
          else
          {
!             /* repalloc will keep the stack in the same context */
              int            new_alloc = afterTriggers.maxtransdepth * 2;

!             afterTriggers.trans_stack = (AfterTriggersTransData *)
!                 repalloc(afterTriggers.trans_stack,
!                          new_alloc * sizeof(AfterTriggersTransData));
              afterTriggers.maxtransdepth = new_alloc;
          }
      }
*************** AfterTriggerBeginSubXact(void)
*** 4604,4613 ****
       * is not saved until/unless changed.  Likewise, we don't make a
       * per-subtransaction event context until needed.
       */
!     afterTriggers.state_stack[my_level] = NULL;
!     afterTriggers.events_stack[my_level] = afterTriggers.events;
!     afterTriggers.depth_stack[my_level] = afterTriggers.query_depth;
!     afterTriggers.firing_stack[my_level] = afterTriggers.firing_counter;
  }

  /*
--- 4768,4777 ----
       * is not saved until/unless changed.  Likewise, we don't make a
       * per-subtransaction event context until needed.
       */
!     afterTriggers.trans_stack[my_level].state = NULL;
!     afterTriggers.trans_stack[my_level].events = afterTriggers.events;
!     afterTriggers.trans_stack[my_level].query_depth = afterTriggers.query_depth;
!     afterTriggers.trans_stack[my_level].firing_counter = afterTriggers.firing_counter;
  }

  /*
*************** AfterTriggerEndSubXact(bool isCommit)
*** 4631,4700 ****
      {
          Assert(my_level < afterTriggers.maxtransdepth);
          /* If we saved a prior state, we don't need it anymore */
!         state = afterTriggers.state_stack[my_level];
          if (state != NULL)
              pfree(state);
          /* this avoids double pfree if error later: */
!         afterTriggers.state_stack[my_level] = NULL;
          Assert(afterTriggers.query_depth ==
!                afterTriggers.depth_stack[my_level]);
      }
      else
      {
          /*
           * Aborting.  It is possible subxact start failed before calling
           * AfterTriggerBeginSubXact, in which case we mustn't risk touching
!          * stack levels that aren't there.
           */
          if (my_level >= afterTriggers.maxtransdepth)
              return;

          /*
!          * Release any event lists from queries being aborted, and restore
           * query_depth to its pre-subxact value.  This assumes that a
           * subtransaction will not add events to query levels started in a
           * earlier transaction state.
           */
!         while (afterTriggers.query_depth > afterTriggers.depth_stack[my_level])
          {
              if (afterTriggers.query_depth < afterTriggers.maxquerydepth)
!             {
!                 Tuplestorestate *ts;
!
!                 ts = afterTriggers.fdw_tuplestores[afterTriggers.query_depth];
!                 if (ts)
!                 {
!                     tuplestore_end(ts);
!                     afterTriggers.fdw_tuplestores[afterTriggers.query_depth] = NULL;
!                 }
!
!                 afterTriggerFreeEventList(&afterTriggers.query_stack[afterTriggers.query_depth]);
!             }
!
              afterTriggers.query_depth--;
          }
          Assert(afterTriggers.query_depth ==
!                afterTriggers.depth_stack[my_level]);

          /*
           * Restore the global deferred-event list to its former length,
           * discarding any events queued by the subxact.
           */
          afterTriggerRestoreEventList(&afterTriggers.events,
!                                      &afterTriggers.events_stack[my_level]);

          /*
           * Restore the trigger state.  If the saved state is NULL, then this
           * subxact didn't save it, so it doesn't need restoring.
           */
!         state = afterTriggers.state_stack[my_level];
          if (state != NULL)
          {
              pfree(afterTriggers.state);
              afterTriggers.state = state;
          }
          /* this avoids double pfree if error later: */
!         afterTriggers.state_stack[my_level] = NULL;

          /*
           * Scan for any remaining deferred events that were marked DONE or IN
--- 4795,4852 ----
      {
          Assert(my_level < afterTriggers.maxtransdepth);
          /* If we saved a prior state, we don't need it anymore */
!         state = afterTriggers.trans_stack[my_level].state;
          if (state != NULL)
              pfree(state);
          /* this avoids double pfree if error later: */
!         afterTriggers.trans_stack[my_level].state = NULL;
          Assert(afterTriggers.query_depth ==
!                afterTriggers.trans_stack[my_level].query_depth);
      }
      else
      {
          /*
           * Aborting.  It is possible subxact start failed before calling
           * AfterTriggerBeginSubXact, in which case we mustn't risk touching
!          * trans_stack levels that aren't there.
           */
          if (my_level >= afterTriggers.maxtransdepth)
              return;

          /*
!          * Release query-level storage for queries being aborted, and restore
           * query_depth to its pre-subxact value.  This assumes that a
           * subtransaction will not add events to query levels started in a
           * earlier transaction state.
           */
!         while (afterTriggers.query_depth > afterTriggers.trans_stack[my_level].query_depth)
          {
              if (afterTriggers.query_depth < afterTriggers.maxquerydepth)
!                 AfterTriggerFreeQuery(&afterTriggers.query_stack[afterTriggers.query_depth]);
              afterTriggers.query_depth--;
          }
          Assert(afterTriggers.query_depth ==
!                afterTriggers.trans_stack[my_level].query_depth);

          /*
           * Restore the global deferred-event list to its former length,
           * discarding any events queued by the subxact.
           */
          afterTriggerRestoreEventList(&afterTriggers.events,
!                                      &afterTriggers.trans_stack[my_level].events);

          /*
           * Restore the trigger state.  If the saved state is NULL, then this
           * subxact didn't save it, so it doesn't need restoring.
           */
!         state = afterTriggers.trans_stack[my_level].state;
          if (state != NULL)
          {
              pfree(afterTriggers.state);
              afterTriggers.state = state;
          }
          /* this avoids double pfree if error later: */
!         afterTriggers.trans_stack[my_level].state = NULL;

          /*
           * Scan for any remaining deferred events that were marked DONE or IN
*************** AfterTriggerEndSubXact(bool isCommit)
*** 4704,4710 ****
           * (This essentially assumes that the current subxact includes all
           * subxacts started after it.)
           */
!         subxact_firing_id = afterTriggers.firing_stack[my_level];
          for_each_event_chunk(event, chunk, afterTriggers.events)
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);
--- 4856,4862 ----
           * (This essentially assumes that the current subxact includes all
           * subxacts started after it.)
           */
!         subxact_firing_id = afterTriggers.trans_stack[my_level].firing_counter;
          for_each_event_chunk(event, chunk, afterTriggers.events)
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);
*************** AfterTriggerEnlargeQueryState(void)
*** 4740,4751 ****
      {
          int            new_alloc = Max(afterTriggers.query_depth + 1, 8);

!         afterTriggers.query_stack = (AfterTriggerEventList *)
              MemoryContextAlloc(TopTransactionContext,
!                                new_alloc * sizeof(AfterTriggerEventList));
!         afterTriggers.fdw_tuplestores = (Tuplestorestate **)
!             MemoryContextAllocZero(TopTransactionContext,
!                                    new_alloc * sizeof(Tuplestorestate *));
          afterTriggers.maxquerydepth = new_alloc;
      }
      else
--- 4892,4900 ----
      {
          int            new_alloc = Max(afterTriggers.query_depth + 1, 8);

!         afterTriggers.query_stack = (AfterTriggersQueryData *)
              MemoryContextAlloc(TopTransactionContext,
!                                new_alloc * sizeof(AfterTriggersQueryData));
          afterTriggers.maxquerydepth = new_alloc;
      }
      else
*************** AfterTriggerEnlargeQueryState(void)
*** 4755,4781 ****
          int            new_alloc = Max(afterTriggers.query_depth + 1,
                                      old_alloc * 2);

!         afterTriggers.query_stack = (AfterTriggerEventList *)
              repalloc(afterTriggers.query_stack,
!                      new_alloc * sizeof(AfterTriggerEventList));
!         afterTriggers.fdw_tuplestores = (Tuplestorestate **)
!             repalloc(afterTriggers.fdw_tuplestores,
!                      new_alloc * sizeof(Tuplestorestate *));
!         /* Clear newly-allocated slots for subsequent lazy initialization. */
!         memset(afterTriggers.fdw_tuplestores + old_alloc,
!                0, (new_alloc - old_alloc) * sizeof(Tuplestorestate *));
          afterTriggers.maxquerydepth = new_alloc;
      }

!     /* Initialize new query lists to empty */
      while (init_depth < afterTriggers.maxquerydepth)
      {
!         AfterTriggerEventList *events;

!         events = &afterTriggers.query_stack[init_depth];
!         events->head = NULL;
!         events->tail = NULL;
!         events->tailfree = NULL;

          ++init_depth;
      }
--- 4904,4925 ----
          int            new_alloc = Max(afterTriggers.query_depth + 1,
                                      old_alloc * 2);

!         afterTriggers.query_stack = (AfterTriggersQueryData *)
              repalloc(afterTriggers.query_stack,
!                      new_alloc * sizeof(AfterTriggersQueryData));
          afterTriggers.maxquerydepth = new_alloc;
      }

!     /* Initialize new array entries to empty */
      while (init_depth < afterTriggers.maxquerydepth)
      {
!         AfterTriggersQueryData *qs = &afterTriggers.query_stack[init_depth];

!         qs->events.head = NULL;
!         qs->events.tail = NULL;
!         qs->events.tailfree = NULL;
!         qs->fdw_tuplestore = NULL;
!         qs->tables = NIL;

          ++init_depth;
      }
*************** AfterTriggerSetState(ConstraintsSetStmt
*** 4873,4881 ****
       * save it so it can be restored if the subtransaction aborts.
       */
      if (my_level > 1 &&
!         afterTriggers.state_stack[my_level] == NULL)
      {
!         afterTriggers.state_stack[my_level] =
              SetConstraintStateCopy(afterTriggers.state);
      }

--- 5017,5025 ----
       * save it so it can be restored if the subtransaction aborts.
       */
      if (my_level > 1 &&
!         afterTriggers.trans_stack[my_level].state == NULL)
      {
!         afterTriggers.trans_stack[my_level].state =
              SetConstraintStateCopy(afterTriggers.state);
      }

*************** AfterTriggerPendingOnRel(Oid relid)
*** 5184,5190 ****
       */
      for (depth = 0; depth <= afterTriggers.query_depth && depth < afterTriggers.maxquerydepth; depth++)
      {
!         for_each_event_chunk(event, chunk, afterTriggers.query_stack[depth])
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);

--- 5328,5334 ----
       */
      for (depth = 0; depth <= afterTriggers.query_depth && depth < afterTriggers.maxquerydepth; depth++)
      {
!         for_each_event_chunk(event, chunk, afterTriggers.query_stack[depth].events)
          {
              AfterTriggerShared evtshared = GetTriggerSharedData(event);

*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5229,5235 ****
      TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
      AfterTriggerEventData new_event;
      AfterTriggerSharedData new_shared;
!     char        relkind = relinfo->ri_RelationDesc->rd_rel->relkind;
      int            tgtype_event;
      int            tgtype_level;
      int            i;
--- 5373,5379 ----
      TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
      AfterTriggerEventData new_event;
      AfterTriggerSharedData new_shared;
!     char        relkind = rel->rd_rel->relkind;
      int            tgtype_event;
      int            tgtype_level;
      int            i;
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5266,5272 ****
              Tuplestorestate *old_tuplestore;

              Assert(oldtup != NULL);
!             old_tuplestore = transition_capture->tcs_old_tuplestore;

              if (map != NULL)
              {
--- 5410,5416 ----
              Tuplestorestate *old_tuplestore;

              Assert(oldtup != NULL);
!             old_tuplestore = transition_capture->tcs_private->old_tuplestore;

              if (map != NULL)
              {
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5284,5293 ****
              Tuplestorestate *new_tuplestore;

              Assert(newtup != NULL);
!             if (event == TRIGGER_EVENT_INSERT)
!                 new_tuplestore = transition_capture->tcs_insert_tuplestore;
!             else
!                 new_tuplestore = transition_capture->tcs_update_tuplestore;

              if (original_insert_tuple != NULL)
                  tuplestore_puttuple(new_tuplestore, original_insert_tuple);
--- 5428,5434 ----
              Tuplestorestate *new_tuplestore;

              Assert(newtup != NULL);
!             new_tuplestore = transition_capture->tcs_private->new_tuplestore;

              if (original_insert_tuple != NULL)
                  tuplestore_puttuple(new_tuplestore, original_insert_tuple);
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5316,5321 ****
--- 5457,5467 ----
       * The event code will be used both as a bitmask and an array offset, so
       * validation is important to make sure we don't walk off the edge of our
       * arrays.
+      *
+      * Also, if we're considering statement-level triggers, check whether we
+      * already queued a set of them for this event, and cancel the prior set
+      * if so.  This preserves the behavior that statement-level triggers fire
+      * just once per statement and fire after row-level triggers.
       */
      switch (event)
      {
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5334,5339 ****
--- 5480,5487 ----
                  Assert(newtup == NULL);
                  ItemPointerSetInvalid(&(new_event.ate_ctid1));
                  ItemPointerSetInvalid(&(new_event.ate_ctid2));
+                 cancel_prior_stmt_triggers(RelationGetRelid(rel),
+                                            CMD_INSERT, event);
              }
              break;
          case TRIGGER_EVENT_DELETE:
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5351,5356 ****
--- 5499,5506 ----
                  Assert(newtup == NULL);
                  ItemPointerSetInvalid(&(new_event.ate_ctid1));
                  ItemPointerSetInvalid(&(new_event.ate_ctid2));
+                 cancel_prior_stmt_triggers(RelationGetRelid(rel),
+                                            CMD_DELETE, event);
              }
              break;
          case TRIGGER_EVENT_UPDATE:
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5368,5373 ****
--- 5518,5525 ----
                  Assert(newtup == NULL);
                  ItemPointerSetInvalid(&(new_event.ate_ctid1));
                  ItemPointerSetInvalid(&(new_event.ate_ctid2));
+                 cancel_prior_stmt_triggers(RelationGetRelid(rel),
+                                            CMD_UPDATE, event);
              }
              break;
          case TRIGGER_EVENT_TRUNCATE:
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5407,5415 ****
          {
              if (fdw_tuplestore == NULL)
              {
!                 fdw_tuplestore =
!                     GetTriggerTransitionTuplestore
!                     (afterTriggers.fdw_tuplestores);
                  new_event.ate_flags = AFTER_TRIGGER_FDW_FETCH;
              }
              else
--- 5559,5565 ----
          {
              if (fdw_tuplestore == NULL)
              {
!                 fdw_tuplestore = GetCurrentFDWTuplestore();
                  new_event.ate_flags = AFTER_TRIGGER_FDW_FETCH;
              }
              else
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5465,5470 ****
--- 5615,5622 ----

          /*
           * Fill in event structure and add it to the current query's queue.
+          * Note we set ats_table to NULL whenever this trigger doesn't use
+          * transition tables, to improve sharability of the shared event data.
           */
          new_shared.ats_event =
              (event & TRIGGER_EVENT_OPMASK) |
*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5474,5484 ****
          new_shared.ats_tgoid = trigger->tgoid;
          new_shared.ats_relid = RelationGetRelid(rel);
          new_shared.ats_firing_id = 0;
!         /* deferrable triggers cannot access transition data */
!         new_shared.ats_transition_capture =
!             trigger->tgdeferrable ? NULL : transition_capture;

!         afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth],
                               &new_event, &new_shared);
      }

--- 5626,5638 ----
          new_shared.ats_tgoid = trigger->tgoid;
          new_shared.ats_relid = RelationGetRelid(rel);
          new_shared.ats_firing_id = 0;
!         if ((trigger->tgoldtable || trigger->tgnewtable) &&
!             transition_capture != NULL)
!             new_shared.ats_table = transition_capture->tcs_private;
!         else
!             new_shared.ats_table = NULL;

!         afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
                               &new_event, &new_shared);
      }

*************** AfterTriggerSaveEvent(EState *estate, Re
*** 5496,5501 ****
--- 5650,5749 ----
      }
  }

+ /*
+  * If we previously queued a set of AFTER STATEMENT triggers for the given
+  * relation + operation, and they've not been fired yet, cancel them.  The
+  * caller will queue a fresh set that's after any row-level triggers that may
+  * have been queued by the current sub-statement, preserving (as much as
+  * possible) the property that AFTER ROW triggers fire before AFTER STATEMENT
+  * triggers, and that the latter only fire once.  This deals with the
+  * situation where several FK enforcement triggers sequentially queue triggers
+  * for the same table into the same trigger query level.  We can't fully
+  * prevent odd behavior though: if there are AFTER ROW triggers taking
+  * transition tables, we don't want to change the transition tables once the
+  * first such trigger has seen them.  In such a case, any additional events
+  * will result in creating new transition tables and allowing new firings of
+  * statement triggers.
+  *
+  * This also saves the current event list location so that a later invocation
+  * of this function can cheaply find the triggers we're about to queue and
+  * cancel them.
+  */
+ static void
+ cancel_prior_stmt_triggers(Oid relid, CmdType cmdType, int tgevent)
+ {
+     AfterTriggersTableData *table;
+     AfterTriggersQueryData *qs = &afterTriggers.query_stack[afterTriggers.query_depth];
+
+     /*
+      * We keep this state in the AfterTriggersTableData that also holds
+      * transition tables for the relation + operation.  In this way, if we are
+      * forced to make a new set of transition tables because more tuples get
+      * entered after we've already fired triggers, we will allow a new set of
+      * statement triggers to get queued without canceling the old ones.
+      */
+     table = GetAfterTriggersTableData(relid, cmdType);
+
+     if (table->stmt_trig_done)
+     {
+         /*
+          * We want to start scanning from the tail location that existed just
+          * before we inserted any statement triggers.  But the events list
+          * might've been entirely empty then, in which case scan from the
+          * current head.
+          */
+         AfterTriggerEvent event;
+         AfterTriggerEventChunk *chunk;
+
+         if (table->stmt_trig_events.tail)
+         {
+             chunk = table->stmt_trig_events.tail;
+             event = (AfterTriggerEvent) table->stmt_trig_events.tailfree;
+         }
+         else
+         {
+             chunk = qs->events.head;
+             event = NULL;
+         }
+
+         for_each_chunk_from(chunk)
+         {
+             if (event == NULL)
+                 event = (AfterTriggerEvent) CHUNK_DATA_START(chunk);
+             for_each_event_from(event, chunk)
+             {
+                 AfterTriggerShared evtshared = GetTriggerSharedData(event);
+
+                 /*
+                  * Exit loop when we reach events that aren't AS triggers for
+                  * the target relation.
+                  */
+                 if (evtshared->ats_relid != relid)
+                     goto done;
+                 if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) != tgevent)
+                     goto done;
+                 if (!TRIGGER_FIRED_FOR_STATEMENT(evtshared->ats_event))
+                     goto done;
+                 if (!TRIGGER_FIRED_AFTER(evtshared->ats_event))
+                     goto done;
+                 /* OK, mark it DONE */
+                 event->ate_flags &= ~AFTER_TRIGGER_IN_PROGRESS;
+                 event->ate_flags |= AFTER_TRIGGER_DONE;
+             }
+             /* signal we must reinitialize event ptr for next chunk */
+             event = NULL;
+         }
+     }
+ done:
+
+     /* In any case, save current insertion point for next time */
+     table->stmt_trig_done = true;
+     table->stmt_trig_events = qs->events;
+ }
+
+ /*
+  * SQL function pg_trigger_depth()
+  */
  Datum
  pg_trigger_depth(PG_FUNCTION_ARGS)
  {
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..b3e74aa 100644
*** a/src/backend/executor/README
--- b/src/backend/executor/README
*************** This is a sketch of control flow for ful
*** 241,251 ****
          CreateExecutorState
              creates per-query context
          switch to per-query context to run ExecInitNode
          ExecInitNode --- recursively scans plan tree
              CreateExprContext
                  creates per-tuple context
              ExecInitExpr
-         AfterTriggerBeginQuery

      ExecutorRun
          ExecProcNode --- recursively called in per-query context
--- 241,251 ----
          CreateExecutorState
              creates per-query context
          switch to per-query context to run ExecInitNode
+         AfterTriggerBeginQuery
          ExecInitNode --- recursively scans plan tree
              CreateExprContext
                  creates per-tuple context
              ExecInitExpr

      ExecutorRun
          ExecProcNode --- recursively called in per-query context
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d4..6255cc6 100644
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
*************** standard_ExecutorStart(QueryDesc *queryD
*** 252,268 ****
      estate->es_instrument = queryDesc->instrument_options;

      /*
-      * Initialize the plan state tree
-      */
-     InitPlan(queryDesc, eflags);
-
-     /*
       * Set up an AFTER-trigger statement context, unless told not to, or
       * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
       */
      if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
          AfterTriggerBeginQuery();

      MemoryContextSwitchTo(oldcontext);
  }

--- 252,268 ----
      estate->es_instrument = queryDesc->instrument_options;

      /*
       * Set up an AFTER-trigger statement context, unless told not to, or
       * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
       */
      if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
          AfterTriggerBeginQuery();

+     /*
+      * Initialize the plan state tree
+      */
+     InitPlan(queryDesc, eflags);
+
      MemoryContextSwitchTo(oldcontext);
  }

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 49586a3..845c409 100644
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
*************** ExecInsert(ModifyTableState *mtstate,
*** 343,348 ****
--- 343,351 ----
                  mtstate->mt_transition_capture->tcs_map = NULL;
              }
          }
+         if (mtstate->mt_oc_transition_capture != NULL)
+             mtstate->mt_oc_transition_capture->tcs_map =
+                 mtstate->mt_transition_tupconv_maps[leaf_part_index];

          /*
           * We might need to convert from the parent rowtype to the partition
*************** lreplace:;
*** 1158,1163 ****
--- 1161,1168 ----
      /* AFTER ROW UPDATE Triggers */
      ExecARUpdateTriggers(estate, resultRelInfo, tupleid, oldtuple, tuple,
                           recheckIndexes,
+                          mtstate->operation == CMD_INSERT ?
+                          mtstate->mt_oc_transition_capture :
                           mtstate->mt_transition_capture);

      list_free(recheckIndexes);
*************** fireASTriggers(ModifyTableState *node)
*** 1444,1450 ****
              if (node->mt_onconflict == ONCONFLICT_UPDATE)
                  ExecASUpdateTriggers(node->ps.state,
                                       resultRelInfo,
!                                      node->mt_transition_capture);
              ExecASInsertTriggers(node->ps.state, resultRelInfo,
                                   node->mt_transition_capture);
              break;
--- 1449,1455 ----
              if (node->mt_onconflict == ONCONFLICT_UPDATE)
                  ExecASUpdateTriggers(node->ps.state,
                                       resultRelInfo,
!                                      node->mt_oc_transition_capture);
              ExecASInsertTriggers(node->ps.state, resultRelInfo,
                                   node->mt_transition_capture);
              break;
*************** ExecSetupTransitionCaptureState(ModifyTa
*** 1474,1487 ****

      /* Check for transition tables on the directly targeted relation. */
      mtstate->mt_transition_capture =
!         MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);

      /*
       * If we found that we need to collect transition tuples then we may also
       * need tuple conversion maps for any children that have TupleDescs that
!      * aren't compatible with the tuplestores.
       */
!     if (mtstate->mt_transition_capture != NULL)
      {
          ResultRelInfo *resultRelInfos;
          int            numResultRelInfos;
--- 1479,1502 ----

      /* Check for transition tables on the directly targeted relation. */
      mtstate->mt_transition_capture =
!         MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc,
!                                    RelationGetRelid(targetRelInfo->ri_RelationDesc),
!                                    mtstate->operation);
!     if (mtstate->operation == CMD_INSERT &&
!         mtstate->mt_onconflict == ONCONFLICT_UPDATE)
!         mtstate->mt_oc_transition_capture =
!             MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc,
!                                        RelationGetRelid(targetRelInfo->ri_RelationDesc),
!                                        CMD_UPDATE);

      /*
       * If we found that we need to collect transition tuples then we may also
       * need tuple conversion maps for any children that have TupleDescs that
!      * aren't compatible with the tuplestores.  (We can share these maps
!      * between the regular and ON CONFLICT cases.)
       */
!     if (mtstate->mt_transition_capture != NULL ||
!         mtstate->mt_oc_transition_capture != NULL)
      {
          ResultRelInfo *resultRelInfos;
          int            numResultRelInfos;
*************** ExecSetupTransitionCaptureState(ModifyTa
*** 1522,1531 ****
          /*
           * Install the conversion map for the first plan for UPDATE and DELETE
           * operations.  It will be advanced each time we switch to the next
!          * plan.  (INSERT operations set it every time.)
           */
!         mtstate->mt_transition_capture->tcs_map =
!             mtstate->mt_transition_tupconv_maps[0];
      }
  }

--- 1537,1548 ----
          /*
           * Install the conversion map for the first plan for UPDATE and DELETE
           * operations.  It will be advanced each time we switch to the next
!          * plan.  (INSERT operations set it every time, so we need not update
!          * mtstate->mt_oc_transition_capture here.)
           */
!         if (mtstate->mt_transition_capture)
!             mtstate->mt_transition_capture->tcs_map =
!                 mtstate->mt_transition_tupconv_maps[0];
      }
  }

*************** ExecModifyTable(PlanState *pstate)
*** 1629,1641 ****
                  estate->es_result_relation_info = resultRelInfo;
                  EvalPlanQualSetPlan(&node->mt_epqstate, subplanstate->plan,
                                      node->mt_arowmarks[node->mt_whichplan]);
                  if (node->mt_transition_capture != NULL)
                  {
-                     /* Prepare to convert transition tuples from this child. */
                      Assert(node->mt_transition_tupconv_maps != NULL);
                      node->mt_transition_capture->tcs_map =
                          node->mt_transition_tupconv_maps[node->mt_whichplan];
                  }
                  continue;
              }
              else
--- 1646,1664 ----
                  estate->es_result_relation_info = resultRelInfo;
                  EvalPlanQualSetPlan(&node->mt_epqstate, subplanstate->plan,
                                      node->mt_arowmarks[node->mt_whichplan]);
+                 /* Prepare to convert transition tuples from this child. */
                  if (node->mt_transition_capture != NULL)
                  {
                      Assert(node->mt_transition_tupconv_maps != NULL);
                      node->mt_transition_capture->tcs_map =
                          node->mt_transition_tupconv_maps[node->mt_whichplan];
                  }
+                 if (node->mt_oc_transition_capture != NULL)
+                 {
+                     Assert(node->mt_transition_tupconv_maps != NULL);
+                     node->mt_oc_transition_capture->tcs_map =
+                         node->mt_transition_tupconv_maps[node->mt_whichplan];
+                 }
                  continue;
              }
              else
*************** ExecInitModifyTable(ModifyTable *node, E
*** 1934,1941 ****
          mtstate->mt_partition_tuple_slot = partition_tuple_slot;
      }

!     /* Build state for collecting transition tuples */
!     ExecSetupTransitionCaptureState(mtstate, estate);

      /*
       * Initialize any WITH CHECK OPTION constraints if needed.
--- 1957,1968 ----
          mtstate->mt_partition_tuple_slot = partition_tuple_slot;
      }

!     /*
!      * Build state for collecting transition tuples.  This requires having a
!      * valid trigger query context, so skip it in explain-only mode.
!      */
!     if (!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
!         ExecSetupTransitionCaptureState(mtstate, estate);

      /*
       * Initialize any WITH CHECK OPTION constraints if needed.
*************** ExecEndModifyTable(ModifyTableState *nod
*** 2319,2334 ****
      int            i;

      /*
-      * Free transition tables, unless this query is being run in
-      * EXEC_FLAG_SKIP_TRIGGERS mode, which means that it may have queued AFTER
-      * triggers that won't be run till later.  In that case we'll just leak
-      * the transition tables till end of (sub)transaction.
-      */
-     if (node->mt_transition_capture != NULL &&
-         !(node->ps.state->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
-         DestroyTransitionCaptureState(node->mt_transition_capture);
-
-     /*
       * Allow any FDWs to shut down
       */
      for (i = 0; i < node->mt_nplans; i++)
--- 2346,2351 ----
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index aeb363f..adbcfa1 100644
*** a/src/include/commands/trigger.h
--- b/src/include/commands/trigger.h
*************** typedef struct TriggerData
*** 43,55 ****

  /*
   * The state for capturing old and new tuples into transition tables for a
!  * single ModifyTable node.
   */
  typedef struct TransitionCaptureState
  {
      /*
       * Is there at least one trigger specifying each transition relation on
       * the relation explicitly named in the DML statement or COPY command?
       */
      bool        tcs_delete_old_table;
      bool        tcs_update_old_table;
--- 43,63 ----

  /*
   * The state for capturing old and new tuples into transition tables for a
!  * single ModifyTable node (or other operation source, e.g. copy.c).
!  *
!  * This is per-caller to avoid conflicts in setting tcs_map or
!  * tcs_original_insert_tuple.  Note, however, that the pointed-to
!  * private data may be shared across multiple callers.
   */
+ struct AfterTriggersTableData;    /* private in trigger.c */
+
  typedef struct TransitionCaptureState
  {
      /*
       * Is there at least one trigger specifying each transition relation on
       * the relation explicitly named in the DML statement or COPY command?
+      * Note: in current usage, these flags could be part of the private state,
+      * but it seems possibly useful to let callers see them.
       */
      bool        tcs_delete_old_table;
      bool        tcs_update_old_table;
*************** typedef struct TransitionCaptureState
*** 60,66 ****
       * For UPDATE and DELETE, AfterTriggerSaveEvent may need to convert the
       * new and old tuples from a child table's format to the format of the
       * relation named in a query so that it is compatible with the transition
!      * tuplestores.
       */
      TupleConversionMap *tcs_map;

--- 68,74 ----
       * For UPDATE and DELETE, AfterTriggerSaveEvent may need to convert the
       * new and old tuples from a child table's format to the format of the
       * relation named in a query so that it is compatible with the transition
!      * tuplestores.  The caller must store the conversion map here if so.
       */
      TupleConversionMap *tcs_map;

*************** typedef struct TransitionCaptureState
*** 74,90 ****
      HeapTuple    tcs_original_insert_tuple;

      /*
!      * The tuplestores backing the transition tables.  We use separate
!      * tuplestores for INSERT and UPDATE, because INSERT ... ON CONFLICT ...
!      * DO UPDATE causes INSERT and UPDATE triggers to fire and needs a way to
!      * keep track of the new tuple images resulting from the two cases
!      * separately.  We only need a single old image tuplestore, because there
!      * is no statement that can both update and delete at the same time.
       */
!     Tuplestorestate *tcs_old_tuplestore;    /* for DELETE and UPDATE old
!                                              * images */
!     Tuplestorestate *tcs_insert_tuplestore; /* for INSERT new images */
!     Tuplestorestate *tcs_update_tuplestore; /* for UPDATE new images */
  } TransitionCaptureState;

  /*
--- 82,90 ----
      HeapTuple    tcs_original_insert_tuple;

      /*
!      * Private data including the tuplestore(s) into which to insert tuples.
       */
!     struct AfterTriggersTableData *tcs_private;
  } TransitionCaptureState;

  /*
*************** extern void RelationBuildTriggers(Relati
*** 174,181 ****
  extern TriggerDesc *CopyTriggerDesc(TriggerDesc *trigdesc);

  extern const char *FindTriggerIncompatibleWithInheritance(TriggerDesc *trigdesc);
! extern TransitionCaptureState *MakeTransitionCaptureState(TriggerDesc *trigdesc);
! extern void DestroyTransitionCaptureState(TransitionCaptureState *tcs);

  extern void FreeTriggerDesc(TriggerDesc *trigdesc);

--- 174,182 ----
  extern TriggerDesc *CopyTriggerDesc(TriggerDesc *trigdesc);

  extern const char *FindTriggerIncompatibleWithInheritance(TriggerDesc *trigdesc);
!
! extern TransitionCaptureState *MakeTransitionCaptureState(TriggerDesc *trigdesc,
!                            Oid relid, CmdType cmdType);

  extern void FreeTriggerDesc(TriggerDesc *trigdesc);

diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..c6d3021 100644
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct ModifyTableState
*** 983,989 ****
      /* Per partition tuple conversion map */
      TupleTableSlot *mt_partition_tuple_slot;
      struct TransitionCaptureState *mt_transition_capture;
!     /* controls transition table population */
      TupleConversionMap **mt_transition_tupconv_maps;
      /* Per plan/partition tuple conversion */
  } ModifyTableState;
--- 983,991 ----
      /* Per partition tuple conversion map */
      TupleTableSlot *mt_partition_tuple_slot;
      struct TransitionCaptureState *mt_transition_capture;
!     /* controls transition table population for specified operation */
!     struct TransitionCaptureState *mt_oc_transition_capture;
!     /* controls transition table population for INSERT...ON CONFLICT UPDATE */
      TupleConversionMap **mt_transition_tupconv_maps;
      /* Per plan/partition tuple conversion */
  } ModifyTableState;
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 620fac1..5b6007a 100644
*** a/src/test/regress/expected/triggers.out
--- b/src/test/regress/expected/triggers.out
*************** with wcte as (insert into table1 values
*** 2217,2222 ****
--- 2217,2239 ----
    insert into table2 values ('hello world');
  NOTICE:  trigger = table2_trig, new table = ("hello world")
  NOTICE:  trigger = table1_trig, new table = (42)
+ with wcte as (insert into table1 values (43))
+   insert into table1 values (44);
+ NOTICE:  trigger = table1_trig, new table = (43), (44)
+ select * from table1;
+  a
+ ----
+  42
+  44
+  43
+ (3 rows)
+
+ select * from table2;
+       a
+ -------------
+  hello world
+ (1 row)
+
  drop table table1;
  drop table table2;
  --
*************** create trigger my_table_multievent_trig
*** 2256,2261 ****
--- 2273,2286 ----
    after insert or update on my_table referencing new table as new_table
    for each statement execute procedure dump_insert();
  ERROR:  transition tables cannot be specified for triggers with more than one event
+ --
+ -- Verify that you can't create a trigger with transition tables with
+ -- a column list.
+ --
+ create trigger my_table_col_update_trig
+   after update of b on my_table referencing new table as new_table
+   for each statement execute procedure dump_insert();
+ ERROR:  transition tables cannot be specified for triggers with column lists
  drop table my_table;
  --
  -- Test firing of triggers with transition tables by foreign key cascades
*************** select * from trig_table;
*** 2299,2306 ****
  (6 rows)

  delete from refd_table where length(b) = 3;
! NOTICE:  trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b")
! NOTICE:  trigger = trig_table_delete_trig, old table = (11,"one a"), (11,"one b")
  select * from trig_table;
   a |    b
  ---+---------
--- 2324,2330 ----
  (6 rows)

  delete from refd_table where length(b) = 3;
! NOTICE:  trigger = trig_table_delete_trig, old table = (2,"two a"), (2,"two b"), (11,"one a"), (11,"one b")
  select * from trig_table;
   a |    b
  ---+---------
*************** select * from trig_table;
*** 2309,2314 ****
--- 2333,2357 ----
  (2 rows)

  drop table refd_table, trig_table;
+ --
+ -- self-referential FKs are even more fun
+ --
+ create table self_ref (a int primary key,
+                        b int references self_ref(a) on delete cascade);
+ create trigger self_ref_r_trig
+   after delete on self_ref referencing old table as old_table
+   for each row execute procedure dump_delete();
+ create trigger self_ref_s_trig
+   after delete on self_ref referencing old table as old_table
+   for each statement execute procedure dump_delete();
+ insert into self_ref values (1, null), (2, 1), (3, 2);
+ delete from self_ref where a = 1;
+ NOTICE:  trigger = self_ref_r_trig, old table = (1,), (2,1)
+ NOTICE:  trigger = self_ref_r_trig, old table = (1,), (2,1)
+ NOTICE:  trigger = self_ref_s_trig, old table = (1,), (2,1)
+ NOTICE:  trigger = self_ref_r_trig, old table = (3,2)
+ NOTICE:  trigger = self_ref_s_trig, old table = (3,2)
+ drop table self_ref;
  -- cleanup
  drop function dump_insert();
  drop function dump_update();
diff --git a/src/test/regress/sql/triggers.sql b/src/test/regress/sql/triggers.sql
index c6deb56..6c7e1c0 100644
*** a/src/test/regress/sql/triggers.sql
--- b/src/test/regress/sql/triggers.sql
*************** create trigger table2_trig
*** 1729,1734 ****
--- 1729,1740 ----
  with wcte as (insert into table1 values (42))
    insert into table2 values ('hello world');

+ with wcte as (insert into table1 values (43))
+   insert into table1 values (44);
+
+ select * from table1;
+ select * from table2;
+
  drop table table1;
  drop table table2;

*************** create trigger my_table_multievent_trig
*** 1769,1774 ****
--- 1775,1789 ----
    after insert or update on my_table referencing new table as new_table
    for each statement execute procedure dump_insert();

+ --
+ -- Verify that you can't create a trigger with transition tables with
+ -- a column list.
+ --
+
+ create trigger my_table_col_update_trig
+   after update of b on my_table referencing new table as new_table
+   for each statement execute procedure dump_insert();
+
  drop table my_table;

  --
*************** select * from trig_table;
*** 1812,1817 ****
--- 1827,1852 ----

  drop table refd_table, trig_table;

+ --
+ -- self-referential FKs are even more fun
+ --
+
+ create table self_ref (a int primary key,
+                        b int references self_ref(a) on delete cascade);
+
+ create trigger self_ref_r_trig
+   after delete on self_ref referencing old table as old_table
+   for each row execute procedure dump_delete();
+ create trigger self_ref_s_trig
+   after delete on self_ref referencing old table as old_table
+   for each statement execute procedure dump_delete();
+
+ insert into self_ref values (1, null), (2, 1), (3, 2);
+
+ delete from self_ref where a = 1;
+
+ drop table self_ref;
+
  -- cleanup
  drop function dump_insert();
  drop function dump_update();

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Thomas Munro

Date:

16 September 2017, 04:45:27

On Sat, Sep 16, 2017 at 7:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Attached is an updated patch that incorporates the ideas you suggested.

I was imagining that you would just need to keep a back pointer to the
last queued event for the same (relation, command), since that's the
only one you'd ever need to consider cancelling, and then no scanning
would be needed.  I am probably missing something.

> I added an extended version of this example, which has an additional
> level of FK cascade happening:
>
> insert into self_ref values (1, null), (2, 1), (3, 2);
> delete from self_ref where a = 1;
> NOTICE:  trigger = self_ref_r_trig, old table = (1,), (2,1)
> NOTICE:  trigger = self_ref_r_trig, old table = (1,), (2,1)
> NOTICE:  trigger = self_ref_s_trig, old table = (1,), (2,1)
> NOTICE:  trigger = self_ref_r_trig, old table = (3,2)
> NOTICE:  trigger = self_ref_s_trig, old table = (3,2)
>
> What happens here is that the outer delete queues AR triggers for
> RI enforcement and self_ref_r_trig, plus an AS trigger for
> self_ref_s_trig.  Then the RI enforcement trigger deletes (2,1)
> and queues AR+AS triggers for that.  At this point the initial
> transition table is still open, so (2,1) goes into that table,
> and we look back and cancel the previous queuing of self_ref_s_trig.
> Now it's time for the first firing of self_ref_r_trig, and so now
> we mark the transition table closed.  Then we skip the cancelled
> self_ref_s_trig call, and then it's time for the second RI enforcement
> trigger to fire, which deletes (3,2) but has to put it into a new
> transition table.  Again we queue AR+AS triggers, but this time we
> can't cancel the preceding AS call.  Then we fire self_ref_r_trig
> again (for the (2,1) row), and then fire self_ref_s_trig; both of
> them see the same transition table the first self_ref_r_trig call
> did.  Now it's time for the third RI enforcement trigger; it finds
> nothing to delete, so it adds nothing to the second transition table,
> but it does queue an AS trigger call (canceling the one added by the
> second RI trigger).  Finally we have the AR call queued by the second
> RI trigger, and then the AS call queued by the third RI trigger,
> both looking at the second transition table.
>
> This is pretty messy but I think it's the best we can do as long as
> RI actions are intermixed with other AFTER ROW triggers.  Maybe with
> Kevin's ideas about converting RI actions to be statement-level,
> we could arrange for all three deletions to show up in one transition
> table ... but I don't know how we cause that to happen before the
> user's AFTER ROW triggers run.  In any case, nothing will look odd
> unless you have AR triggers using transition tables, which seems like
> a niche usage case in the first place.

It does seem like an inconsistency that it would be good to fix, but I
don't immediately see how to make that happen with the current design.
It would be interesting to know what DB2 does here in terms of trigger
execution contexts and transition tables when you have a chain of 2, 3
and 4 foreign key referential actions.

Is it worth adding a test with an extra level of chaining in the self_ref case?

Is it worth adding tests for SET NULL and SET DEFAULT, to exercise the
complete set of referential actions?

> I also realized that we could undo the bloat I added to
> AfterTriggerSharedData by storing a pointer to the AfterTriggersTableData
> rather than the tuplestores themselves.

Makes sense.

> I feel that this is probably committable, except that I still need
> to look for documentation changes.

+1

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14808: V10-beta4, backend abort

From

Tom Lane

Date:

16 September 2017, 06:38:20

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sat, Sep 16, 2017 at 7:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Attached is an updated patch that incorporates the ideas you suggested.

> I was imagining that you would just need to keep a back pointer to the
> last queued event for the same (relation, command), since that's the
> only one you'd ever need to consider cancelling, and then no scanning
> would be needed.  I am probably missing something.

There could be more than one after-statement trigger, no?

>> This is pretty messy but I think it's the best we can do as long as
>> RI actions are intermixed with other AFTER ROW triggers.

> It does seem like an inconsistency that it would be good to fix, but I
> don't immediately see how to make that happen with the current design.
> It would be interesting to know what DB2 does here in terms of trigger
> execution contexts and transition tables when you have a chain of 2, 3
> and 4 foreign key referential actions.

> Is it worth adding a test with an extra level of chaining in the self_ref case?

Would it show anything not shown by the three-level case?

> Is it worth adding tests for SET NULL and SET DEFAULT, to exercise the
> complete set of referential actions?

I think they're all about the same as far as this is concerned.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs