Thread: Repeated crashes in GENERATED ... AS IDENTITY tests

Repeated crashes in GENERATED ... AS IDENTITY tests

From
Tom Lane
Date:
woodlouse just showed a failure that looked a bit familiar:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=woodlouse&dt=2018-04-17%2016%3A42%3A43

I trawled the buildfarm logs for similar failures, and there are several:

  sysname  |      snapshot       | stage |
l                                                                              

-----------+---------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------
 whelk     | 2018-03-28 05:41:30 | Check | 2018-03-28 07:52:26.987 CEST [4928:6] DETAIL:  Failed process was running:
ALTERTABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;\r 
 conchuela | 2018-04-03 10:00:00 | Check | 2018-04-03 12:11:47.520 CEST [5ac35311.5248:5] DETAIL:  Failed process was
running:ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY; 
 gharial   | 2018-04-10 00:32:08 | Check | 2018-04-09 18:59:39.075 MDT [5acc0b05.32ea:5] DETAIL:  Failed process was
running:ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY; 
 whelk     | 2018-04-14 19:41:24 | Check | 2018-04-14 21:52:07.700 CEST [3760:6] DETAIL:  Failed process was running:
ALTERTABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;\r 
 woodlouse | 2018-04-17 16:42:43 | Check | 2018-04-17 18:54:04.311 CEST [864:6] DETAIL:  Failed process was running:
ALTERTABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY;\r 
(5 rows)

The test case that's failing, in identity.sql, has been there since
early February; the lack of any crashes till more recently suggests
that something committed in mid-to-late March broke it.

I have no idea what's going on there, but I think this is clearly
something we need to consider an open item for PG11.  I'll put it
on the list.

            regards, tom lane


Re: Repeated crashes in GENERATED ... AS IDENTITY tests

From
Alvaro Herrera
Date:
Tom Lane wrote:

> The test case that's failing, in identity.sql, has been there since
> early February; the lack of any crashes till more recently suggests
> that something committed in mid-to-late March broke it.
> 
> I have no idea what's going on there, but I think this is clearly
> something we need to consider an open item for PG11.  I'll put it
> on the list.

Grumble.

Of the machines you listed, conchuela has gdb installed, which gives us
a nice backtrace of the crash, pasted below, which seems to blame event
triggers.  Of the tests in the same parallel group as identity, the test
fast_default seems to be the only one with an event trigger and was
added on March 28th by 16828d5c0273.  That combination seems to explain
why the crashes appeared late ...

I can't look further into this now -- maybe next week if nobody has
beaten me into it.  My guess is that the identity stuff is not setting
state like event triggers expect.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2018-04-03%2010%3A00%3A00

================== stack trace: pgsql.build/src/test/regress/tmp_check/data/postgres.core ==================
[New <main task>]
Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
#0  EventTriggerTableRewrite (parsetree=parsetree@entry=0x80a852670, tableOid=49861, reason=2) at event_trigger.c:974
974        currentEventTriggerState->table_rewrite_oid = tableOid;
#0  EventTriggerTableRewrite (parsetree=parsetree@entry=0x80a852670, tableOid=49861, reason=2) at event_trigger.c:974
#1  0x0000000000601bfc in ATRewriteTables (lockmode=8, wqueue=0x7fffffffe3d8, parsetree=0x80a852670) at
tablecmds.c:4326
#2  ATController (parsetree=parsetree@entry=0x80a852670, rel=<optimized out>, cmds=<optimized out>, recurse=<optimized
out>,lockmode=lockmode@entry=8) at tablecmds.c:3536
 
#3  0x00000000006021af in AlterTable (relid=relid@entry=49861, lockmode=lockmode@entry=8, stmt=stmt@entry=0x80a852670)
attablecmds.c:3206
 
#4  0x000000000077df9f in ProcessUtilitySlow (pstate=pstate@entry=0x80a852118, pstmt=pstmt@entry=0x801647110,
queryString=queryString@entry=0x801646118"ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY;",
context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,
completionTag=0x7fffffffe830"", dest=0x801647400) at utility.c:1116
 
#5  0x000000000077c632 in standard_ProcessUtility (pstmt=0x801647110, queryString=0x801646118 "ALTER TABLE itest13 ADD
COLUMNb int GENERATED BY DEFAULT AS IDENTITY;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x801647400,completionTag=0x7fffffffe830 "") at utility.c:919
 
#6  0x0000000000779e05 in PortalRunUtility (portal=0x80a60c118, pstmt=0x801647110, isTopLevel=<optimized out>,
setHoldSnapshot=<optimizedout>, dest=0x801647400, completionTag=0x7fffffffe830 "") at pquery.c:1183
 
#7  0x000000000077a925 in PortalRunMulti (portal=portal@entry=0x80a60c118, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x801647400, altdest=altdest@entry=0x801647400,
completionTag=completionTag@entry=0x7fffffffe830"") at pquery.c:1336
 
#8  0x000000000077b5be in PortalRun (portal=portal@entry=0x80a60c118, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x801647400,
altdest=altdest@entry=0x801647400,completionTag=0x7fffffffe830 "") at pquery.c:804
 
#9  0x00000000007771e5 in exec_simple_query (query_string=0x801646118 "ALTER TABLE itest13 ADD COLUMN b int GENERATED
BYDEFAULT AS IDENTITY;") at postgres.c:1121
 
#10 0x000000000077869b in PostgresMain (argc=<optimized out>, argv=argv@entry=0x8012f03e8, dbname=0x8012f0190
"regression",username=<optimized out>) at postgres.c:4149
 
#11 0x0000000000480398 in BackendRun (port=0x8016c0280) at postmaster.c:4409
#12 BackendStartup (port=0x8016c0280) at postmaster.c:4081
#13 ServerLoop () at postmaster.c:1754
#14 0x0000000000701009 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x7ffffffff5d0) at postmaster.c:1362
#15 0x0000000000481978 in main (argc=8, argv=0x7ffffffff5d0) at main.c:228

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Repeated crashes in GENERATED ... AS IDENTITY tests

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Of the machines you listed, conchuela has gdb installed, which gives us
> a nice backtrace of the crash, pasted below, which seems to blame event
> triggers.  Of the tests in the same parallel group as identity, the test
> fast_default seems to be the only one with an event trigger and was
> added on March 28th by 16828d5c0273.  That combination seems to explain
> why the crashes appeared late ...

Hm.  Our past policy has been that tests involving event triggers must not
run in parallel with any other test, precisely because the event triggers
interact with anything else happening at the same time, and will
inevitably create irreproducible test failures.

I'm inclined to say that whether or not there's a bug here (and there
well may be, it doesn't seem like a crash is a good thing), this is
bad test design and we need to change it.

            regards, tom lane


I wrote:
> I'm inclined to say that whether or not there's a bug here (and there
> well may be, it doesn't seem like a crash is a good thing), this is
> bad test design and we need to change it.

So my suspicion was aroused by the fact that, unlike almost every
other function in event_trigger.c, EventTriggerTableRewrite does
not bother to verify that currentEventTriggerState isn't null before
dereferencing it.  I soon found out how to reproduce the crash
observed in the buildfarm:

1. In session 1, set a breakpoint at ATController, and do

CREATE TABLE itest13 (a int);
ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY;

2. After the ALTER reaches the breakpoint, start a second session and
create an event trigger.  The one used by fast_default will do fine:

CREATE FUNCTION log_rewrite() RETURNS event_trigger
LANGUAGE plpgsql as
$func$

declare
   this_schema text;
begin
    select into this_schema relnamespace::regnamespace::text
    from pg_class
    where oid = pg_event_trigger_table_rewrite_oid();
    if this_schema = 'fast_default'
    then
        RAISE NOTICE 'rewriting table % for reason %',
          pg_event_trigger_table_rewrite_oid()::regclass,
          pg_event_trigger_table_rewrite_reason();
    end if;
end;
$func$;

CREATE EVENT TRIGGER has_volatile_rewrite
                  ON table_rewrite
   EXECUTE PROCEDURE log_rewrite();

3. Allow session 1 to continue from its breakpoint.  Kaboom!

The reason of course is that EventTriggerCommonSetup finds the
now-relevant event trigger and returns a nonempty list, but our
currently active command hasn't initialized any event trigger
support because there were no event triggers when it started.
So whoever thought they could omit the standard guard check
here was full of it.

Hence, two questions:

* Should EventTriggerTableRewrite do

    if (!currentEventTriggerState ||
        currentEventTriggerState->commandCollectionInhibited)
        return;

like most of the other functions, or should it just check for null
currentEventTriggerState?

* Of the three other callers of EventTriggerCommonSetup, only one
has such a guard now.  But aren't EventTriggerDDLCommandStart and
EventTriggerDDLCommandEnd equally vulnerable to this type of race
condition?  What should they check exactly?  Perhaps
EventTriggerCommonSetup itself should check this?

The point that running fast_default in parallel with a pile of other
regression tests is damfool test design still stands, but I have to
credit it with having exposed a bug.

            regards, tom lane


Re: Repeated crashes in GENERATED ... AS IDENTITY tests

From
Andrew Gierth
Date:
>>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

 Alvaro> I can't look further into this now -- maybe next week if nobody
 Alvaro> has beaten me into it. My guess is that the identity stuff is
 Alvaro> not setting state like event triggers expect.

I think this is unrelated to the identity stuff. I suspect a race
condition: the event trigger code is implicitly assuming that the event
trigger cache can't pick up new entries in between
EventTriggerStartCompleteQuery and actually firing a table rewrite
operation.

If initially there are no event triggers that require
currentEventTriggerState, then EventTriggerStartCompleteQuery won't
allocate one. But that's done before actually executing the command, so
if we accept invalidations on the cache afterwards, a newly added
trigger might show up, so by the time we reach EventTriggerTableRewrite
we think we have work to do, but currentEventTriggerState is still null.

I haven't been able to reproduce it yet, so this is conjecture, but I
think it's correct. It being a relatively narrow race explains the
relative rarity of the failure.

For the other event trigger types, SQL_DROP checks for
currentEventTriggerState's validity, so it'll simply fail to run the
trigger if one wasn't already present at command start; DDL_COMMAND_END
doesn't actually access currentEventTriggerState at all unless the
trigger function calls pg_event_trigger_ddl_commands, which will falsely
return an error in the race case but won't crash.

So the obvious simple fix would be to have EventTriggerTableRewrite
likewise do nothing if currentEventTriggerState is not set (and it would
be more consistent to do the same for command_end triggers too).

-- 
Andrew (irc:RhodiumToad)


Tom Lane wrote:

> Hence, two questions:
> 
> * Should EventTriggerTableRewrite do
> 
>     if (!currentEventTriggerState ||
>         currentEventTriggerState->commandCollectionInhibited)
>         return;
> 
> like most of the other functions, or should it just check for null
> currentEventTriggerState?

I vaguely recall that commandCollectionInhibited is to hide internal
queries launched via SPI by matview REFRESH CONCURRENTLY.  I'm not sure
if refresh involves a table rewrite that could trigger the bug here, but
I'd be consistent about it.

I don't know the answer to the second question.

> The point that running fast_default in parallel with a pile of other
> regression tests is damfool test design still stands, but I have to
> credit it with having exposed a bug.

Indubitably.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


I wrote:
> Hence, two questions:
> * Should EventTriggerTableRewrite do
>     if (!currentEventTriggerState ||
>         currentEventTriggerState->commandCollectionInhibited)
>         return;
> like most of the other functions, or should it just check for null
> currentEventTriggerState?

After closer inspection I've concluded that it should not look
at commandCollectionInhibited; that disables collection of some
data, but it's not related to whether we ought to fire triggers,
AFAICS.

> * Of the three other callers of EventTriggerCommonSetup, only one
> has such a guard now.  But aren't EventTriggerDDLCommandStart and
> EventTriggerDDLCommandEnd equally vulnerable to this type of race
> condition?  What should they check exactly?  Perhaps
> EventTriggerCommonSetup itself should check this?

And the answer here is that DDLCommandStart must fire regardless of the
existence of currentEventTriggerState, because we don't set that up when
there are only ddl_command_start triggers.  But it seems like
EventTriggerDDLCommandEnd should quit if it's not set up.  That function
itself wouldn't crash, but presumably the called triggers would call
pg_event_trigger_ddl_commands which would fail.  Better to pretend the
triggers aren't active yet.

(I see Andrew Gierth came to the same conclusion.)

            regards, tom lane