Thread: BUG #17607: Server process crashes when PLpgSQL function raises error in subtransaction

BUG #17607: Server process crashes when PLpgSQL function raises error in subtransaction

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17607
Logged by:          Thomas Mc Kay
Email address:      thomas.d.mckay@gmail.com
PostgreSQL version: 13.8
Operating system:   Debian in docker
Description:

Hello,

I've encountered a bug while doing some SQL migrations on a PostgreSQL 13.8
server. My initial use-case is a Django app running migrations, a test setup
and a test, but that code was more than 10K lines of DDL, so I've reduced
the code to a minimal reproducible example (as best I could).

I also tested the same script in 14.0 and 14.5 and in both, the bug is
fixed. I haven't seen anything related to it in the changelog for 14.0, so I
don't know whether the bug was fixed purposefully and forgotten in the
changelog or if it was fixed by accident, maybe as a corollary of another
bugfix/feature. I also don't know if this is something the devs would want
to fix in the 13.x line. Either way, I figured it might make a good
regression test case for 14.x.

Environment (Docker):
- PostgreSQL 13.8 (Debian 13.8-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled
by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit


Test code:

```
-- Migration 1
CREATE TABLE "tbl" ("id" serial NOT NULL PRIMARY KEY);

-- Migration 2
ALTER TABLE "tbl" ADD COLUMN "value" int DEFAULT 0 NOT NULL;

CREATE FUNCTION on_tbl_parent_id_change_fn() RETURNS TRIGGER AS
$on_tbl_parent_id_change$
    BEGIN
        RAISE EXCEPTION 'BOOM !';
    END
$on_tbl_parent_id_change$ LANGUAGE PLpgSQL VOLATILE;

CREATE TRIGGER on_tbl_parent_id_update
    AFTER UPDATE ON tbl
    REFERENCING OLD TABLE AS old_rows NEW TABLE AS new_rows
    FOR EACH STATEMENT
    EXECUTE PROCEDURE on_tbl_parent_id_change_fn();

-- Test setup
INSERT INTO "tbl" ("value") VALUES (2);

-- Test
BEGIN;
SAVEPOINT svp;
UPDATE tbl SET value=1;  -- Fails here
ROLLBACK TO SAVEPOINT svp;
ROLLBACK;
```

Output (verbose):

```
-- CREATE TABLE
-- ALTER TABLE
-- CREATE FUNCTION
-- CREATE TRIGGER
-- INSERT 0 1
-- BEGIN
-- SAVEPOINT
-- WARNING:  01000: AbortSubTransaction while in ABORT state
-- LOCATION:  AbortSubTransaction, xact.c:4999
-- WARNING:  01000: AbortSubTransaction while in ABORT state
-- LOCATION:  AbortSubTransaction, xact.c:4999
-- WARNING:  01000: AbortSubTransaction while in ABORT state
-- LOCATION:  AbortSubTransaction, xact.c:4999
-- ERROR:  P0001: BOOM !
-- CONTEXT:  PL/pgSQL function on_tbl_parent_id_change_fn() line 3 at
RAISE
-- LOCATION:  exec_stmt_raise, pl_exec.c:3854
-- ERROR:  XX000: tupdesc reference 0x7fc02873fbc8 is not owned by resource
owner SubTransaction
-- LOCATION:  ResourceOwnerForgetTupleDesc, resowner.c:1188
-- ERROR:  XX000: tupdesc reference 0x7fc02873fbc8 is not owned by resource
owner SubTransaction
-- LOCATION:  ResourceOwnerForgetTupleDesc, resowner.c:1188
-- ERROR:  XX000: tupdesc reference 0x7fc02873fbc8 is not owned by resource
owner SubTransaction
-- LOCATION:  ResourceOwnerForgetTupleDesc, resowner.c:1188
-- ERROR:  XX000: tupdesc reference 0x7fc02873fbc8 is not owned by resource
owner SubTransaction
-- LOCATION:  ResourceOwnerForgetTupleDesc, resowner.c:1188
-- PANIC:  XX000: ERRORDATA_STACK_SIZE exceeded
-- LOCATION:  errstart, elog.c:359
-- server closed the connection unexpectedly
--         This probably means the server terminated abnormally
--         before or while processing the request.
-- The connection to the server was lost. Attempting reset: Failed.
-- You are currently not connected to a database.
-- You are currently not connected to a database.
```

Expected output:
```
CREATE TABLE
ALTER TABLE
CREATE FUNCTION
CREATE TRIGGER
INSERT 0 1
BEGIN
SAVEPOINT
ERROR:  BOOM !
CONTEXT:  PL/pgSQL function on_tbl_parent_id_change_fn() line 3 at RAISE
ROLLBACK
ROLLBACK
```

The bug disappears if any of the following is true:
- value is added to the table definition and not with ALTER TABLE.
- value is declared without a default.
- `REFERENCING OLD TABLE AS old_rows NEW TABLE AS new_rows` is removed from
the trigger.
- a `VACUUM FULL tbl` is done between the trigger creation and the insert.

I don't know what to make of it, though.

Cheers,

Thomas


PG Bug reporting form <noreply@postgresql.org> writes:
> I've encountered a bug while doing some SQL migrations on a PostgreSQL 13.8
> server. My initial use-case is a Django app running migrations, a test setup
> and a test, but that code was more than 10K lines of DDL, so I've reduced
> the code to a minimal reproducible example (as best I could).

Thanks for the minimal reproducer!  I confirm that this problem is visible
in the v12 and v13 branches, but not before or after.  On the master
branch, it was introduced at

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL_14_BR [25936fd46] 2021-02-27 18:09:15 -0300
    Fix use-after-free bug with AfterTriggersTableData.storeslot

and repaired, seemingly accidentally, by

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_14_BR [c5b7ba4e6] 2021-04-06 15:57:11 -0400
    Postpone some stuff out of ExecInitModifyTable.

25936fd46 was back-patched into v12 and v13, which is why they show
the bug.  c5b7ba4e6 was not back-patched, both because it'd be way too
invasive and because we weren't aware that it was fixing anything.
I'm not really sure that it *did* fix anything --- perhaps there are
variants of this example that still fail?

cc'ing Alvaro for comment.

            regards, tom lane



I wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> I've encountered a bug while doing some SQL migrations on a PostgreSQL 13.8
>> server. My initial use-case is a Django app running migrations, a test setup
>> and a test, but that code was more than 10K lines of DDL, so I've reduced
>> the code to a minimal reproducible example (as best I could).

> Thanks for the minimal reproducer!  I confirm that this problem is visible
> in the v12 and v13 branches, but not before or after.

I got a chance to look at this today, and what I find is that we're making
a tuple slot that will be freed at end of subtransaction, but it has a
refcount on a refcounted tuple descriptor that belongs to the Portal's
resowner.  While trying to free the slot, we get a complaint because the
subtransaction's resowner isn't holding the appropriate tupdesc reference.

It seems to me there are probably other hazards here, because the tupdesc
could possibly also go away before the slot does.  I think what we ought
to do is copy the tupdesc, so that we have a non-refcounted descriptor
that we know has exactly the right lifespan.  As attached.

Not sure about whether to bother with a test case.  The submitted
reproducer doesn't work in >= v14, and even in v12/v13 it seems awfully
corner-case-ish.  (I find for instance that you need the ALTER TABLE
step rather than just making the table in one step, and I'm a tad
baffled as to why.)  In any case, I've got zero faith that there are
not other ways to trigger the same problem, so I think we need to apply
this up to HEAD.

            regards, tom lane

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 59b38d00ed..b9a012512d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -4340,11 +4340,13 @@ GetAfterTriggersStoreSlot(AfterTriggersTableData *table,
         MemoryContext    oldcxt;

         /*
-         * We only need this slot only until AfterTriggerEndQuery, but making
-         * it last till end-of-subxact is good enough.  It'll be freed by
-         * AfterTriggerFreeQuery().
+         * We need this slot only until AfterTriggerEndQuery, but making it
+         * last till end-of-subxact is good enough.  It'll be freed by
+         * AfterTriggerFreeQuery().  However, the passed-in tupdesc might have
+         * a different lifespan, so we'd better make a copy of that.
          */
         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+        tupdesc = CreateTupleDescCopy(tupdesc);
         table->storeslot = MakeSingleTupleTableSlot(tupdesc, &TTSOpsVirtual);
         MemoryContextSwitchTo(oldcxt);
     }

I wrote:
> It seems to me there are probably other hazards here, because the tupdesc
> could possibly also go away before the slot does.  I think what we ought
> to do is copy the tupdesc, so that we have a non-refcounted descriptor
> that we know has exactly the right lifespan.  As attached.

Actually, after looking around some more, I realize that there is a
second mistake in 25936fd46: it ignored the comment on
AfterTriggerFreeQuery that

 *    Note: it's important for this to be safe if interrupted by an error
 *    and then called again for the same query level.

This is the reason why we end up in a recursive error and PANIC:
we keep trying to free the tupdesc again after the previous error.
If I fix that but omit the CreateTupleDescCopy step, then the
reproducer behaves much more sanely:

psql:bug17607.sql:29: WARNING:  AbortSubTransaction while in ABORT state
psql:bug17607.sql:29: ERROR:  BOOM !
CONTEXT:  PL/pgSQL function on_tbl_parent_id_change_fn() line 3 at RAISE
ERROR:  tupdesc reference 0x7fdef236a1b8 is not owned by resource owner SubTransaction
ROLLBACK
ROLLBACK

So what we actually need here is more like the attached.

            regards, tom lane

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 59b38d00ed..1b1193997b 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -4340,11 +4340,13 @@ GetAfterTriggersStoreSlot(AfterTriggersTableData *table,
         MemoryContext    oldcxt;

         /*
-         * We only need this slot only until AfterTriggerEndQuery, but making
-         * it last till end-of-subxact is good enough.  It'll be freed by
-         * AfterTriggerFreeQuery().
+         * We need this slot only until AfterTriggerEndQuery, but making it
+         * last till end-of-subxact is good enough.  It'll be freed by
+         * AfterTriggerFreeQuery().  However, the passed-in tupdesc might have
+         * a different lifespan, so we'd better make a copy of that.
          */
         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+        tupdesc = CreateTupleDescCopy(tupdesc);
         table->storeslot = MakeSingleTupleTableSlot(tupdesc, &TTSOpsVirtual);
         MemoryContextSwitchTo(oldcxt);
     }
@@ -4640,7 +4642,12 @@ AfterTriggerFreeQuery(AfterTriggersQueryData *qs)
         if (ts)
             tuplestore_end(ts);
         if (table->storeslot)
-            ExecDropSingleTupleTableSlot(table->storeslot);
+        {
+            TupleTableSlot *slot = table->storeslot;
+
+            table->storeslot = NULL;
+            ExecDropSingleTupleTableSlot(slot);
+        }
     }

     /*

... and after yet more poking at it, I see why c5b7ba4e6 masked
the problem: it added an optimization such that we don't use the
storeslot at all unless a tuple mapping conversion is required.
That led me to the test case shown in the attached, which exhibits
this symptom in all branches since v12.

I had been planning to let this wait until after 15rc1 wrap,
but now that I have evidence that the bug is still live in v15,
I'm going to go ahead and push.  Both components of the fix
seem like extremely safe changes, even for the day before wrap.

            regards, tom lane

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 0ec8d84a1b..0fcf090f22 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -4775,11 +4775,13 @@ GetAfterTriggersStoreSlot(AfterTriggersTableData *table,
         MemoryContext oldcxt;

         /*
-         * We only need this slot only until AfterTriggerEndQuery, but making
-         * it last till end-of-subxact is good enough.  It'll be freed by
-         * AfterTriggerFreeQuery().
+         * We need this slot only until AfterTriggerEndQuery, but making it
+         * last till end-of-subxact is good enough.  It'll be freed by
+         * AfterTriggerFreeQuery().  However, the passed-in tupdesc might have
+         * a different lifespan, so we'd better make a copy of that.
          */
         oldcxt = MemoryContextSwitchTo(CurTransactionContext);
+        tupdesc = CreateTupleDescCopy(tupdesc);
         table->storeslot = MakeSingleTupleTableSlot(tupdesc, &TTSOpsVirtual);
         MemoryContextSwitchTo(oldcxt);
     }
@@ -5098,7 +5100,12 @@ AfterTriggerFreeQuery(AfterTriggersQueryData *qs)
         if (ts)
             tuplestore_end(ts);
         if (table->storeslot)
-            ExecDropSingleTupleTableSlot(table->storeslot);
+        {
+            TupleTableSlot *slot = table->storeslot;
+
+            table->storeslot = NULL;
+            ExecDropSingleTupleTableSlot(slot);
+        }
     }

     /*
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 641f5e67c4..8b8eadd181 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -3468,7 +3468,7 @@ insert into convslot_test_parent(col1) values ('1');
 insert into convslot_test_child(col1) values ('1');
 insert into convslot_test_parent(col1) values ('3');
 insert into convslot_test_child(col1) values ('3');
-create or replace function trigger_function1()
+create function convslot_trig1()
 returns trigger
 language plpgsql
 AS $$
@@ -3478,7 +3478,7 @@ raise notice 'trigger = %, old_table = %',
           (select string_agg(old_table::text, ', ' order by col1) from old_table);
 return null;
 end; $$;
-create or replace function trigger_function2()
+create function convslot_trig2()
 returns trigger
 language plpgsql
 AS $$
@@ -3490,10 +3490,10 @@ return null;
 end; $$;
 create trigger but_trigger after update on convslot_test_child
 referencing new table as new_table
-for each statement execute function trigger_function2();
+for each statement execute function convslot_trig2();
 update convslot_test_parent set col1 = col1 || '1';
 NOTICE:  trigger = but_trigger, new table = (11,tutu), (31,tutu)
-create or replace function trigger_function3()
+create function convslot_trig3()
 returns trigger
 language plpgsql
 AS $$
@@ -3506,16 +3506,42 @@ return null;
 end; $$;
 create trigger but_trigger2 after update on convslot_test_child
 referencing old table as old_table new table as new_table
-for each statement execute function trigger_function3();
+for each statement execute function convslot_trig3();
 update convslot_test_parent set col1 = col1 || '1';
 NOTICE:  trigger = but_trigger, new table = (111,tutu), (311,tutu)
 NOTICE:  trigger = but_trigger2, old_table = (11,tutu), (31,tutu), new table = (111,tutu), (311,tutu)
 create trigger bdt_trigger after delete on convslot_test_child
 referencing old table as old_table
-for each statement execute function trigger_function1();
+for each statement execute function convslot_trig1();
 delete from convslot_test_parent;
 NOTICE:  trigger = bdt_trigger, old_table = (111,tutu), (311,tutu)
 drop table convslot_test_child, convslot_test_parent;
+drop function convslot_trig1();
+drop function convslot_trig2();
+drop function convslot_trig3();
+-- Bug #17607: variant of above in which trigger function raises an error;
+-- we don't see any ill effects unless trigger tuple requires mapping
+create table convslot_test_parent (id int primary key, val int)
+partition by range (id);
+create table convslot_test_part (val int, id int not null);
+alter table convslot_test_parent
+  attach partition convslot_test_part for values from (1) to (1000);
+create function convslot_trig4() returns trigger as
+$$begin raise exception 'BOOM!'; end$$ language plpgsql;
+create trigger convslot_test_parent_update
+    after update on convslot_test_parent
+    referencing old table as old_rows new table as new_rows
+    for each statement execute procedure convslot_trig4();
+insert into convslot_test_parent (id, val) values (1, 2);
+begin;
+savepoint svp;
+update convslot_test_parent set val = 3;  -- error expected
+ERROR:  BOOM!
+CONTEXT:  PL/pgSQL function convslot_trig4() line 1 at RAISE
+rollback to savepoint svp;
+rollback;
+drop table convslot_test_parent;
+drop function convslot_trig4();
 -- Test trigger renaming on partitioned tables
 create table grandparent (id int, primary key (id)) partition by range (id);
 create table middle partition of grandparent for values from (1) to (10)
diff --git a/src/test/regress/sql/triggers.sql b/src/test/regress/sql/triggers.sql
index 9d21bbe2a9..4d8504fb24 100644
--- a/src/test/regress/sql/triggers.sql
+++ b/src/test/regress/sql/triggers.sql
@@ -2615,7 +2615,7 @@ insert into convslot_test_child(col1) values ('1');
 insert into convslot_test_parent(col1) values ('3');
 insert into convslot_test_child(col1) values ('3');

-create or replace function trigger_function1()
+create function convslot_trig1()
 returns trigger
 language plpgsql
 AS $$
@@ -2626,7 +2626,7 @@ raise notice 'trigger = %, old_table = %',
 return null;
 end; $$;

-create or replace function trigger_function2()
+create function convslot_trig2()
 returns trigger
 language plpgsql
 AS $$
@@ -2639,11 +2639,11 @@ end; $$;

 create trigger but_trigger after update on convslot_test_child
 referencing new table as new_table
-for each statement execute function trigger_function2();
+for each statement execute function convslot_trig2();

 update convslot_test_parent set col1 = col1 || '1';

-create or replace function trigger_function3()
+create function convslot_trig3()
 returns trigger
 language plpgsql
 AS $$
@@ -2657,15 +2657,48 @@ end; $$;

 create trigger but_trigger2 after update on convslot_test_child
 referencing old table as old_table new table as new_table
-for each statement execute function trigger_function3();
+for each statement execute function convslot_trig3();
 update convslot_test_parent set col1 = col1 || '1';

 create trigger bdt_trigger after delete on convslot_test_child
 referencing old table as old_table
-for each statement execute function trigger_function1();
+for each statement execute function convslot_trig1();
 delete from convslot_test_parent;

 drop table convslot_test_child, convslot_test_parent;
+drop function convslot_trig1();
+drop function convslot_trig2();
+drop function convslot_trig3();
+
+-- Bug #17607: variant of above in which trigger function raises an error;
+-- we don't see any ill effects unless trigger tuple requires mapping
+
+create table convslot_test_parent (id int primary key, val int)
+partition by range (id);
+
+create table convslot_test_part (val int, id int not null);
+
+alter table convslot_test_parent
+  attach partition convslot_test_part for values from (1) to (1000);
+
+create function convslot_trig4() returns trigger as
+$$begin raise exception 'BOOM!'; end$$ language plpgsql;
+
+create trigger convslot_test_parent_update
+    after update on convslot_test_parent
+    referencing old table as old_rows new table as new_rows
+    for each statement execute procedure convslot_trig4();
+
+insert into convslot_test_parent (id, val) values (1, 2);
+
+begin;
+savepoint svp;
+update convslot_test_parent set val = 3;  -- error expected
+rollback to savepoint svp;
+rollback;
+
+drop table convslot_test_parent;
+drop function convslot_trig4();

 -- Test trigger renaming on partitioned tables
 create table grandparent (id int, primary key (id)) partition by range (id);