Thread: tableam scan-API patch broke foreign key validation

tableam scan-API patch broke foreign key validation

From

Tom Lane

Date:

06 April 2019, 18:07:55

It seems that the fire-the-triggers code path in
validateForeignKeyConstraint isn't being exercised; at least, that's
what coverage.postgresql.org says right now, and I'm afraid that may
have been true for quite some time.  The attached regression-test
addition causes it to be exercised, and guess what: it blows up real
good.

This is a slightly adapted version of the test Hadi proposed in
https://postgr.es/m/CAK=1=WonwcuN_0KiZwQO3SQxse41jZ5hOJRpFCvZ3qa8n9cssw@mail.gmail.com
Since he didn't mention anything about core dumps or assertion
failures, one assumes that it did work as of the version he was
testing against.

What it looks like to me is that because of this hunk in c2fe139c2:

@@ -8962,7 +8981,8 @@ validateForeignKeyConstraint(char *conname,
         trigdata.type = T_TriggerData;
         trigdata.tg_event = TRIGGER_EVENT_INSERT | TRIGGER_EVENT_ROW;
         trigdata.tg_relation = rel;
-        trigdata.tg_trigtuple = tuple;
+        trigdata.tg_trigtuple = ExecFetchSlotHeapTuple(slot, true, NULL);
+        trigdata.tg_trigslot = slot;
         trigdata.tg_newtuple = NULL;
         trigdata.tg_trigger = &trig;

validateForeignKeyConstraint asks ExecFetchSlotHeapTuple to materialize
the tuple, which causes it to no longer be associated with a buffer,
which causes heapam_tuple_satisfies_snapshot to be very unhappy.

I can make the case not crash by s/true/false/ in the above call,
but I wonder whether that's an appropriate fix.  It seems rather
fragile that things work like this.

I plan to go ahead and commit Hadi's fix with that change included
(as below), but I wonder whether anything else needs to be revisited.

            regards, tom lane

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e842f91..31fe40a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4459,13 +4459,8 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
     {
         AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);

-        /*
-         * Foreign tables have no storage, nor do partitioned tables and
-         * indexes.
-         */
-        if (tab->relkind == RELKIND_FOREIGN_TABLE ||
-            tab->relkind == RELKIND_PARTITIONED_TABLE ||
-            tab->relkind == RELKIND_PARTITIONED_INDEX)
+        /* Relations without storage may be ignored here */
+        if (!RELKIND_HAS_STORAGE(tab->relkind))
             continue;

         /*
@@ -4645,6 +4640,10 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
         Relation    rel = NULL;
         ListCell   *lcon;

+        /* Relations without storage may be ignored here too */
+        if (!RELKIND_HAS_STORAGE(tab->relkind))
+            continue;
+
         foreach(lcon, tab->constraints)
         {
             NewConstraint *con = lfirst(lcon);
@@ -9647,7 +9646,7 @@ validateForeignKeyConstraint(char *conname,
         trigdata.type = T_TriggerData;
         trigdata.tg_event = TRIGGER_EVENT_INSERT | TRIGGER_EVENT_ROW;
         trigdata.tg_relation = rel;
-        trigdata.tg_trigtuple = ExecFetchSlotHeapTuple(slot, true, NULL);
+        trigdata.tg_trigtuple = ExecFetchSlotHeapTuple(slot, false, NULL);
         trigdata.tg_trigslot = slot;
         trigdata.tg_newtuple = NULL;
         trigdata.tg_trigger = &trig;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index 620eb43..e62b220 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -1895,6 +1895,30 @@ INSERT INTO fk_notpartitioned_pk VALUES (1600, 601), (1600, 1601);
 ALTER TABLE fk_partitioned_fk ATTACH PARTITION fk_partitioned_fk_2
   FOR VALUES IN (1600);
 -- leave these tables around intentionally
+-- test the case when the referenced table is owned by a different user
+create role regress_other_partitioned_fk_owner;
+grant references on fk_notpartitioned_pk to regress_other_partitioned_fk_owner;
+set role regress_other_partitioned_fk_owner;
+create table other_partitioned_fk(a int, b int) partition by list (a);
+create table other_partitioned_fk_1 partition of other_partitioned_fk
+  for values in (2048);
+insert into other_partitioned_fk values (2048, 4096);
+-- this should fail
+alter table other_partitioned_fk add foreign key (a, b)
+  references fk_notpartitioned_pk(a, b);
+ERROR:  insert or update on table "other_partitioned_fk_1" violates foreign key constraint
"other_partitioned_fk_a_b_fkey"
+DETAIL:  Key (a, b)=(2048, 4096) is not present in table "fk_notpartitioned_pk".
+-- add the missing key and retry
+reset role;
+insert into fk_notpartitioned_pk (a, b) values (2048, 4096);
+set role regress_other_partitioned_fk_owner;
+alter table other_partitioned_fk add foreign key (a, b)
+  references fk_notpartitioned_pk(a, b);
+-- clean up
+drop table other_partitioned_fk;
+reset role;
+revoke all on fk_notpartitioned_pk from regress_other_partitioned_fk_owner;
+drop role regress_other_partitioned_fk_owner;
 -- Test creating a constraint at the parent that already exists in partitions.
 -- There should be no duplicated constraints, and attempts to drop the
 -- constraint in partitions should raise appropriate errors.
diff --git a/src/test/regress/sql/foreign_key.sql b/src/test/regress/sql/foreign_key.sql
index 1a86850..090bd09 100644
--- a/src/test/regress/sql/foreign_key.sql
+++ b/src/test/regress/sql/foreign_key.sql
@@ -1365,6 +1365,29 @@ ALTER TABLE fk_partitioned_fk ATTACH PARTITION fk_partitioned_fk_2

 -- leave these tables around intentionally

+-- test the case when the referenced table is owned by a different user
+create role regress_other_partitioned_fk_owner;
+grant references on fk_notpartitioned_pk to regress_other_partitioned_fk_owner;
+set role regress_other_partitioned_fk_owner;
+create table other_partitioned_fk(a int, b int) partition by list (a);
+create table other_partitioned_fk_1 partition of other_partitioned_fk
+  for values in (2048);
+insert into other_partitioned_fk values (2048, 4096);
+-- this should fail
+alter table other_partitioned_fk add foreign key (a, b)
+  references fk_notpartitioned_pk(a, b);
+-- add the missing key and retry
+reset role;
+insert into fk_notpartitioned_pk (a, b) values (2048, 4096);
+set role regress_other_partitioned_fk_owner;
+alter table other_partitioned_fk add foreign key (a, b)
+  references fk_notpartitioned_pk(a, b);
+-- clean up
+drop table other_partitioned_fk;
+reset role;
+revoke all on fk_notpartitioned_pk from regress_other_partitioned_fk_owner;
+drop role regress_other_partitioned_fk_owner;
+
 -- Test creating a constraint at the parent that already exists in partitions.
 -- There should be no duplicated constraints, and attempts to drop the
 -- constraint in partitions should raise appropriate errors.

Re: tableam scan-API patch broke foreign key validation

From

Andres Freund

Date:

06 April 2019, 18:09:21

Hi,

On April 6, 2019 11:07:55 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>It seems that the fire-the-triggers code path in
>validateForeignKeyConstraint isn't being exercised; at least, that's
>what coverage.postgresql.org says right now, and I'm afraid that may
>have been true for quite some time.  The attached regression-test
>addition causes it to be exercised, and guess what: it blows up real
>good.
>
>This is a slightly adapted version of the test Hadi proposed in
>https://postgr.es/m/CAK=1=WonwcuN_0KiZwQO3SQxse41jZ5hOJRpFCvZ3qa8n9cssw@mail.gmail.com
>Since he didn't mention anything about core dumps or assertion
>failures, one assumes that it did work as of the version he was
>testing against.
>
>What it looks like to me is that because of this hunk in c2fe139c2:
>
>@@ -8962,7 +8981,8 @@ validateForeignKeyConstraint(char *conname,
>         trigdata.type = T_TriggerData;
>         trigdata.tg_event = TRIGGER_EVENT_INSERT | TRIGGER_EVENT_ROW;
>         trigdata.tg_relation = rel;
>-        trigdata.tg_trigtuple = tuple;
>+        trigdata.tg_trigtuple = ExecFetchSlotHeapTuple(slot, true, NULL);
>+        trigdata.tg_trigslot = slot;
>         trigdata.tg_newtuple = NULL;
>         trigdata.tg_trigger = &trig;
>
>validateForeignKeyConstraint asks ExecFetchSlotHeapTuple to materialize
>the tuple, which causes it to no longer be associated with a buffer,
>which causes heapam_tuple_satisfies_snapshot to be very unhappy.
>
>I can make the case not crash by s/true/false/ in the above call,
>but I wonder whether that's an appropriate fix.  It seems rather
>fragile that things work like this.
>
>I plan to go ahead and commit Hadi's fix with that change included
>(as below), but I wonder whether anything else needs to be revisited.

I posted pretty much that patch nearby, with some other questions. Was waiting for David to respond.... Let me dig that
out.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: tableam scan-API patch broke foreign key validation

From

Tom Lane

Date:

06 April 2019, 18:13:29

Andres Freund <andres@anarazel.de> writes:
> On April 6, 2019 11:07:55 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I plan to go ahead and commit Hadi's fix with that change included
>> (as below), but I wonder whether anything else needs to be revisited.

> I posted pretty much that patch nearby, with some other questions. Was waiting for David to respond.... Let me dig
thatout. 

Ah.  Would you rather I wait till you push yours?

            regards, tom lane

Re: tableam scan-API patch broke foreign key validation

From

Andres Freund

Date:

06 April 2019, 18:22:58

Hi,

On 2019-04-06 14:13:29 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On April 6, 2019 11:07:55 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I plan to go ahead and commit Hadi's fix with that change included
> >> (as below), but I wonder whether anything else needs to be revisited.
> 
> > I posted pretty much that patch nearby, with some other questions. Was waiting for David to respond.... Let me dig
thatout.
 

The relevant thread is:
https://www.postgresql.org/message-id/20190325180405.jytoehuzkeozggxx%40alap3.anarazel.de


> Ah.  Would you rather I wait till you push yours?

Yours looks good to me, so go ahead. I think we need a bit more than
that, but that can easily be committed separately:

Wonder if you have an opinion on:

> I've also noticed that we should free the tuple - that doesn't matter
> for heap, but it sure does for other callers.  But uh, is it actually ok
> to validate an entire table's worth of foreign keys without a memory
> context reset? I.e. shouldn't we have a memory context that we reset
> after each iteration?
> 
> Also, why's there no CHECK_FOR_INTERRUPTS()? heap has some internally on
> a page level, but that doesn't seem all that granular?


Greetings,

Andres Freund

Re: tableam scan-API patch broke foreign key validation

From

Tom Lane

Date:

06 April 2019, 18:34:34

Andres Freund <andres@anarazel.de> writes:
> The relevant thread is:
> https://www.postgresql.org/message-id/20190325180405.jytoehuzkeozggxx%40alap3.anarazel.de

Yeah, I just found that --- would have seen it sooner if David had
not elected to make it a new thread.

> Wonder if you have an opinion on:

>> I've also noticed that we should free the tuple - that doesn't matter
>> for heap, but it sure does for other callers.

Why should this code need to free anything?  That'd be the responsibility
of the slot code, no?

>> But uh, is it actually ok
>> to validate an entire table's worth of foreign keys without a memory
>> context reset? I.e. shouldn't we have a memory context that we reset
>> after each iteration?
>> Also, why's there no CHECK_FOR_INTERRUPTS()? heap has some internally on
>> a page level, but that doesn't seem all that granular?

These are good questions.  Just eyeing RI_FKey_check(), I think
that it might not have any significant leaks because most of the work
is done in an SPI context, but obviously that's pretty fragile.

The memory-context stuff in your WIP patch seems wrong, btw;
the second or later iteration of the loop would trash oldcxt.

But clearly we need a test case here.  I'll adjust Hadi's example
so that there's more than one tuple to check, and push it.

            regards, tom lane

Re: tableam scan-API patch broke foreign key validation

From

Andres Freund

Date:

06 April 2019, 18:38:34

Hi,

On 2019-04-06 14:34:34 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The relevant thread is:
> > https://www.postgresql.org/message-id/20190325180405.jytoehuzkeozggxx%40alap3.anarazel.de
> 
> Yeah, I just found that --- would have seen it sooner if David had
> not elected to make it a new thread.
> 
> > Wonder if you have an opinion on:
> 
> >> I've also noticed that we should free the tuple - that doesn't matter
> >> for heap, but it sure does for other callers.
> 
> Why should this code need to free anything?  That'd be the responsibility
> of the slot code, no?

Well, not really. If a slot doesn't hold heap tuples internally,
ExecFetchSlotHeapTuple() will return a fresh heap tuple (but signal so
by setting *should_free = true if not NULL).  That's why I was saying it
doesn't matter for heap (where the slot just holds a heap tuple
internally), but it does matter for other AMs.


> >> But uh, is it actually ok
> >> to validate an entire table's worth of foreign keys without a memory
> >> context reset? I.e. shouldn't we have a memory context that we reset
> >> after each iteration?
> >> Also, why's there no CHECK_FOR_INTERRUPTS()? heap has some internally on
> >> a page level, but that doesn't seem all that granular?
> 
> These are good questions.  Just eyeing RI_FKey_check(), I think
> that it might not have any significant leaks because most of the work
> is done in an SPI context, but obviously that's pretty fragile.

Yea. And especially with potentially needing to free the tuple as above,
using an explicit context seems more robust to me.


> But clearly we need a test case here.  I'll adjust Hadi's example
> so that there's more than one tuple to check, and push it.

Cool.

Greetings,

Andres Freund

Re: tableam scan-API patch broke foreign key validation

From

Tom Lane

Date:

06 April 2019, 18:43:26

Andres Freund <andres@anarazel.de> writes:
> On 2019-04-06 14:34:34 -0400, Tom Lane wrote:
>> Why should this code need to free anything?  That'd be the responsibility
>> of the slot code, no?

> Well, not really. If a slot doesn't hold heap tuples internally,
> ExecFetchSlotHeapTuple() will return a fresh heap tuple (but signal so
> by setting *should_free = true if not NULL).

Ah, got it: ignoring should_free is indeed a potential issue here.

>> But clearly we need a test case here.  I'll adjust Hadi's example
>> so that there's more than one tuple to check, and push it.

> Cool.

Sounds like a plan.

            regards, tom lane

Re: tableam scan-API patch broke foreign key validation

From

Tom Lane

Date:

06 April 2019, 19:21:21

Andres Freund <andres@anarazel.de> writes:
> On 2019-04-06 14:34:34 -0400, Tom Lane wrote:
>> These are good questions.  Just eyeing RI_FKey_check(), I think
>> that it might not have any significant leaks because most of the work
>> is done in an SPI context, but obviously that's pretty fragile.

> Yea. And especially with potentially needing to free the tuple as above,
> using an explicit context seems more robust to me.

Adjusting the committed test case to process lots of tuples shows that
indeed there is no leak in HEAD.  But I concur that using a local context
here would be more future-proof.

BTW, I just stumbled across a different bug in v11 by trying to run
HEAD's test script on it ... not sure if that's a known problem or not:

(gdb) f 3
#3  0x000000000063949c in ExecSetupPartitionTupleRouting (
    mtstate=<value optimized out>, rel=0x7f343e4f4170) at execPartition.c:201
201             Assert(update_rri_index == num_update_rri);
(gdb) bt
#0  0x00000037b6232495 in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000037b6233c75 in abort () at abort.c:92
#2  0x00000000008a1e6d in ExceptionalCondition (
    conditionName=<value optimized out>, errorType=<value optimized out>,
    fileName=<value optimized out>, lineNumber=<value optimized out>)
    at assert.c:54
#3  0x000000000063949c in ExecSetupPartitionTupleRouting (
    mtstate=<value optimized out>, rel=0x7f343e4f4170) at execPartition.c:201
#4  0x00000000006595cb in ExecInitModifyTable (node=0x26a0680,
    estate=0x26a1fa8, eflags=16) at nodeModifyTable.c:2343
#5  0x000000000063b179 in ExecInitNode (node=0x26a0680, estate=0x26a1fa8,
    eflags=16) at execProcnode.c:174
#6  0x0000000000635824 in InitPlan (queryDesc=<value optimized out>, eflags=16)
    at execMain.c:1046
#7  standard_ExecutorStart (queryDesc=<value optimized out>, eflags=16)
    at execMain.c:265
#8  0x000000000066c332 in _SPI_pquery (plan=0x269fb38, paramLI=0x26b9048,
    snapshot=0x0, crosscheck_snapshot=0x0, read_only=false,
    fire_triggers=false, tcount=0) at spi.c:2482
#9  _SPI_execute_plan (plan=0x269fb38, paramLI=0x26b9048, snapshot=0x0,
    crosscheck_snapshot=0x0, read_only=false, fire_triggers=false, tcount=0)
    at spi.c:2246
#10 0x000000000066c7b6 in SPI_execute_snapshot (plan=0x269fb38,
    Values=<value optimized out>, Nulls=<value optimized out>, snapshot=0x0,
    crosscheck_snapshot=0x0, read_only=false, fire_triggers=false, tcount=0)
    at spi.c:562
#11 0x0000000000838842 in ri_PerformCheck (riinfo=0x268d9c0,
    qkey=0x7fff8996f700, qplan=0x269fb38, fk_rel=0x7f343e4f4170,
    pk_rel=0x7f343e4f49d0, old_tuple=0x7fff8996fd40, new_tuple=0x0,
    detectNewRows=true, expect_OK=9) at ri_triggers.c:2606
#12 0x0000000000839971 in ri_setnull (trigdata=<value optimized out>)
    at ri_triggers.c:1400
#13 0x000000000060b0a8 in ExecCallTriggerFunc (trigdata=0x7fff8996fce0,
    tgindx=0, finfo=0x26c55e0, instr=0x0, per_tuple_context=0x26aeef0)
    at trigger.c:2412
#14 0x000000000060b5e5 in AfterTriggerExecute (events=0x260b8d8, firing_id=1,
    estate=0x26c5098, delete_ok=false) at trigger.c:4359
#15 afterTriggerInvokeEvents (events=0x260b8d8, firing_id=1, estate=0x26c5098,
    delete_ok=false) at trigger.c:4550
#16 0x000000000060cb82 in AfterTriggerEndQuery (estate=0x26c5098)
    at trigger.c:4860
#17 0x0000000000636871 in standard_ExecutorFinish (queryDesc=0x2717b88)
    at execMain.c:439
#18 0x0000000000795bf8 in ProcessQuery (plan=<value optimized out>,
    sourceText=0x258ef98 "DELETE FROM fk_notpartitioned_pk;", params=0x0,
    queryEnv=0x0, dest=<value optimized out>,
    completionTag=0x7fff89970030 "DELETE 2") at pquery.c:205
#19 0x0000000000795e35 in PortalRunMulti (portal=0x25f4518, isTopLevel=true,
    setHoldSnapshot=false, dest=0x271bb90, altdest=0x271bb90,
    completionTag=0x7fff89970030 "DELETE 2") at pquery.c:1286
#20 0x0000000000796610 in PortalRun (portal=0x25f4518,
    count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x271bb90,
    altdest=0x271bb90, completionTag=0x7fff89970030 "DELETE 2") at pquery.c:799
#21 0x00000000007929dd in exec_simple_query (
    query_string=0x258ef98 "DELETE FROM fk_notpartitioned_pk;")
    at postgres.c:1145
#22 0x0000000000793f34 in PostgresMain (argc=<value optimized out>,
    argv=<value optimized out>, dbname=0x25b8a18 "regression",
    username=<value optimized out>) at postgres.c:4182


            regards, tom lane

Re: tableam scan-API patch broke foreign key validation

From

Andres Freund

Date:

08 April 2019, 05:54:40

Hi,

On 2019-04-06 14:43:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-04-06 14:34:34 -0400, Tom Lane wrote:
> >> Why should this code need to free anything?  That'd be the responsibility
> >> of the slot code, no?
> 
> > Well, not really. If a slot doesn't hold heap tuples internally,
> > ExecFetchSlotHeapTuple() will return a fresh heap tuple (but signal so
> > by setting *should_free = true if not NULL).
> 
> Ah, got it: ignoring should_free is indeed a potential issue here.

I've pushed a revised version of my earlier patch adding a memory
context that's reset after each tuple.

Greetings,

Andres Freund

Re: tableam scan-API patch broke foreign key validation

From

Alvaro Herrera

Date:

25 April 2019, 18:20:52

On 2019-Apr-06, Tom Lane wrote:

> BTW, I just stumbled across a different bug in v11 by trying to run
> HEAD's test script on it ... not sure if that's a known problem or not:
> 
> (gdb) f 3
> #3  0x000000000063949c in ExecSetupPartitionTupleRouting (
>     mtstate=<value optimized out>, rel=0x7f343e4f4170) at execPartition.c:201
> 201             Assert(update_rri_index == num_update_rri);
> (gdb) bt
> #0  0x00000037b6232495 in raise (sig=6)
>     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64

For closure: this was re-reported as
https://www.postgresql.org/message-id/20710.1554582479@sss.pgh.pa.us
and the fix committed as 10e3991fad8a300ed268878ae30c96074628c1e1.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services