Thread: Repeated crashes in GENERATED ... AS IDENTITY tests
woodlouse just showed a failure that looked a bit familiar: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=woodlouse&dt=2018-04-17%2016%3A42%3A43 I trawled the buildfarm logs for similar failures, and there are several: sysname | snapshot | stage | l -----------+---------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------ whelk | 2018-03-28 05:41:30 | Check | 2018-03-28 07:52:26.987 CEST [4928:6] DETAIL: Failed process was running: ALTERTABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;\r conchuela | 2018-04-03 10:00:00 | Check | 2018-04-03 12:11:47.520 CEST [5ac35311.5248:5] DETAIL: Failed process was running:ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY; gharial | 2018-04-10 00:32:08 | Check | 2018-04-09 18:59:39.075 MDT [5acc0b05.32ea:5] DETAIL: Failed process was running:ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY; whelk | 2018-04-14 19:41:24 | Check | 2018-04-14 21:52:07.700 CEST [3760:6] DETAIL: Failed process was running: ALTERTABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;\r woodlouse | 2018-04-17 16:42:43 | Check | 2018-04-17 18:54:04.311 CEST [864:6] DETAIL: Failed process was running: ALTERTABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY;\r (5 rows) The test case that's failing, in identity.sql, has been there since early February; the lack of any crashes till more recently suggests that something committed in mid-to-late March broke it. I have no idea what's going on there, but I think this is clearly something we need to consider an open item for PG11. I'll put it on the list. regards, tom lane
Tom Lane wrote: > The test case that's failing, in identity.sql, has been there since > early February; the lack of any crashes till more recently suggests > that something committed in mid-to-late March broke it. > > I have no idea what's going on there, but I think this is clearly > something we need to consider an open item for PG11. I'll put it > on the list. Grumble. Of the machines you listed, conchuela has gdb installed, which gives us a nice backtrace of the crash, pasted below, which seems to blame event triggers. Of the tests in the same parallel group as identity, the test fast_default seems to be the only one with an event trigger and was added on March 28th by 16828d5c0273. That combination seems to explain why the crashes appeared late ... I can't look further into this now -- maybe next week if nobody has beaten me into it. My guess is that the identity stuff is not setting state like event triggers expect. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2018-04-03%2010%3A00%3A00 ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/postgres.core ================== [New <main task>] Core was generated by `postgres'. Program terminated with signal 11, Segmentation fault. #0 EventTriggerTableRewrite (parsetree=parsetree@entry=0x80a852670, tableOid=49861, reason=2) at event_trigger.c:974 974 currentEventTriggerState->table_rewrite_oid = tableOid; #0 EventTriggerTableRewrite (parsetree=parsetree@entry=0x80a852670, tableOid=49861, reason=2) at event_trigger.c:974 #1 0x0000000000601bfc in ATRewriteTables (lockmode=8, wqueue=0x7fffffffe3d8, parsetree=0x80a852670) at tablecmds.c:4326 #2 ATController (parsetree=parsetree@entry=0x80a852670, rel=<optimized out>, cmds=<optimized out>, recurse=<optimized out>,lockmode=lockmode@entry=8) at tablecmds.c:3536 #3 0x00000000006021af in AlterTable (relid=relid@entry=49861, lockmode=lockmode@entry=8, stmt=stmt@entry=0x80a852670) attablecmds.c:3206 #4 0x000000000077df9f in ProcessUtilitySlow (pstate=pstate@entry=0x80a852118, pstmt=pstmt@entry=0x801647110, queryString=queryString@entry=0x801646118"ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY;", context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, completionTag=0x7fffffffe830"", dest=0x801647400) at utility.c:1116 #5 0x000000000077c632 in standard_ProcessUtility (pstmt=0x801647110, queryString=0x801646118 "ALTER TABLE itest13 ADD COLUMNb int GENERATED BY DEFAULT AS IDENTITY;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x801647400,completionTag=0x7fffffffe830 "") at utility.c:919 #6 0x0000000000779e05 in PortalRunUtility (portal=0x80a60c118, pstmt=0x801647110, isTopLevel=<optimized out>, setHoldSnapshot=<optimizedout>, dest=0x801647400, completionTag=0x7fffffffe830 "") at pquery.c:1183 #7 0x000000000077a925 in PortalRunMulti (portal=portal@entry=0x80a60c118, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x801647400, altdest=altdest@entry=0x801647400, completionTag=completionTag@entry=0x7fffffffe830"") at pquery.c:1336 #8 0x000000000077b5be in PortalRun (portal=portal@entry=0x80a60c118, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x801647400, altdest=altdest@entry=0x801647400,completionTag=0x7fffffffe830 "") at pquery.c:804 #9 0x00000000007771e5 in exec_simple_query (query_string=0x801646118 "ALTER TABLE itest13 ADD COLUMN b int GENERATED BYDEFAULT AS IDENTITY;") at postgres.c:1121 #10 0x000000000077869b in PostgresMain (argc=<optimized out>, argv=argv@entry=0x8012f03e8, dbname=0x8012f0190 "regression",username=<optimized out>) at postgres.c:4149 #11 0x0000000000480398 in BackendRun (port=0x8016c0280) at postmaster.c:4409 #12 BackendStartup (port=0x8016c0280) at postmaster.c:4081 #13 ServerLoop () at postmaster.c:1754 #14 0x0000000000701009 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x7ffffffff5d0) at postmaster.c:1362 #15 0x0000000000481978 in main (argc=8, argv=0x7ffffffff5d0) at main.c:228 -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Of the machines you listed, conchuela has gdb installed, which gives us > a nice backtrace of the crash, pasted below, which seems to blame event > triggers. Of the tests in the same parallel group as identity, the test > fast_default seems to be the only one with an event trigger and was > added on March 28th by 16828d5c0273. That combination seems to explain > why the crashes appeared late ... Hm. Our past policy has been that tests involving event triggers must not run in parallel with any other test, precisely because the event triggers interact with anything else happening at the same time, and will inevitably create irreproducible test failures. I'm inclined to say that whether or not there's a bug here (and there well may be, it doesn't seem like a crash is a good thing), this is bad test design and we need to change it. regards, tom lane
Event trigger bugs (was Re: Repeated crashes in GENERATED ... AS IDENTITY tests)
From
Tom Lane
Date:
I wrote: > I'm inclined to say that whether or not there's a bug here (and there > well may be, it doesn't seem like a crash is a good thing), this is > bad test design and we need to change it. So my suspicion was aroused by the fact that, unlike almost every other function in event_trigger.c, EventTriggerTableRewrite does not bother to verify that currentEventTriggerState isn't null before dereferencing it. I soon found out how to reproduce the crash observed in the buildfarm: 1. In session 1, set a breakpoint at ATController, and do CREATE TABLE itest13 (a int); ALTER TABLE itest13 ADD COLUMN b int GENERATED BY DEFAULT AS IDENTITY; 2. After the ALTER reaches the breakpoint, start a second session and create an event trigger. The one used by fast_default will do fine: CREATE FUNCTION log_rewrite() RETURNS event_trigger LANGUAGE plpgsql as $func$ declare this_schema text; begin select into this_schema relnamespace::regnamespace::text from pg_class where oid = pg_event_trigger_table_rewrite_oid(); if this_schema = 'fast_default' then RAISE NOTICE 'rewriting table % for reason %', pg_event_trigger_table_rewrite_oid()::regclass, pg_event_trigger_table_rewrite_reason(); end if; end; $func$; CREATE EVENT TRIGGER has_volatile_rewrite ON table_rewrite EXECUTE PROCEDURE log_rewrite(); 3. Allow session 1 to continue from its breakpoint. Kaboom! The reason of course is that EventTriggerCommonSetup finds the now-relevant event trigger and returns a nonempty list, but our currently active command hasn't initialized any event trigger support because there were no event triggers when it started. So whoever thought they could omit the standard guard check here was full of it. Hence, two questions: * Should EventTriggerTableRewrite do if (!currentEventTriggerState || currentEventTriggerState->commandCollectionInhibited) return; like most of the other functions, or should it just check for null currentEventTriggerState? * Of the three other callers of EventTriggerCommonSetup, only one has such a guard now. But aren't EventTriggerDDLCommandStart and EventTriggerDDLCommandEnd equally vulnerable to this type of race condition? What should they check exactly? Perhaps EventTriggerCommonSetup itself should check this? The point that running fast_default in parallel with a pile of other regression tests is damfool test design still stands, but I have to credit it with having exposed a bug. regards, tom lane
>>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes: Alvaro> I can't look further into this now -- maybe next week if nobody Alvaro> has beaten me into it. My guess is that the identity stuff is Alvaro> not setting state like event triggers expect. I think this is unrelated to the identity stuff. I suspect a race condition: the event trigger code is implicitly assuming that the event trigger cache can't pick up new entries in between EventTriggerStartCompleteQuery and actually firing a table rewrite operation. If initially there are no event triggers that require currentEventTriggerState, then EventTriggerStartCompleteQuery won't allocate one. But that's done before actually executing the command, so if we accept invalidations on the cache afterwards, a newly added trigger might show up, so by the time we reach EventTriggerTableRewrite we think we have work to do, but currentEventTriggerState is still null. I haven't been able to reproduce it yet, so this is conjecture, but I think it's correct. It being a relatively narrow race explains the relative rarity of the failure. For the other event trigger types, SQL_DROP checks for currentEventTriggerState's validity, so it'll simply fail to run the trigger if one wasn't already present at command start; DDL_COMMAND_END doesn't actually access currentEventTriggerState at all unless the trigger function calls pg_event_trigger_ddl_commands, which will falsely return an error in the race case but won't crash. So the obvious simple fix would be to have EventTriggerTableRewrite likewise do nothing if currentEventTriggerState is not set (and it would be more consistent to do the same for command_end triggers too). -- Andrew (irc:RhodiumToad)
Re: Event trigger bugs (was Re: Repeated crashes in GENERATED ... ASIDENTITY tests)
From
Alvaro Herrera
Date:
Tom Lane wrote: > Hence, two questions: > > * Should EventTriggerTableRewrite do > > if (!currentEventTriggerState || > currentEventTriggerState->commandCollectionInhibited) > return; > > like most of the other functions, or should it just check for null > currentEventTriggerState? I vaguely recall that commandCollectionInhibited is to hide internal queries launched via SPI by matview REFRESH CONCURRENTLY. I'm not sure if refresh involves a table rewrite that could trigger the bug here, but I'd be consistent about it. I don't know the answer to the second question. > The point that running fast_default in parallel with a pile of other > regression tests is damfool test design still stands, but I have to > credit it with having exposed a bug. Indubitably. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Event trigger bugs (was Re: Repeated crashes in GENERATED ... AS IDENTITY tests)
From
Tom Lane
Date:
I wrote: > Hence, two questions: > * Should EventTriggerTableRewrite do > if (!currentEventTriggerState || > currentEventTriggerState->commandCollectionInhibited) > return; > like most of the other functions, or should it just check for null > currentEventTriggerState? After closer inspection I've concluded that it should not look at commandCollectionInhibited; that disables collection of some data, but it's not related to whether we ought to fire triggers, AFAICS. > * Of the three other callers of EventTriggerCommonSetup, only one > has such a guard now. But aren't EventTriggerDDLCommandStart and > EventTriggerDDLCommandEnd equally vulnerable to this type of race > condition? What should they check exactly? Perhaps > EventTriggerCommonSetup itself should check this? And the answer here is that DDLCommandStart must fire regardless of the existence of currentEventTriggerState, because we don't set that up when there are only ddl_command_start triggers. But it seems like EventTriggerDDLCommandEnd should quit if it's not set up. That function itself wouldn't crash, but presumably the called triggers would call pg_event_trigger_ddl_commands which would fail. Better to pretend the triggers aren't active yet. (I see Andrew Gierth came to the same conclusion.) regards, tom lane