Failures during FATAL exit - Mailing list pgsql-hackers

From Tom Lane
Subject Failures during FATAL exit
Date
Msg-id 3634445.1602711426@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
As already noted in another thread, buildfarm member chipmunk
failed today with an unexpected Assert [1].  I've now reproduced
this by manually killing the postmaster during the regression
tests.  The stack trace looks like

#0  0x0000ffff91507598 in raise () from /lib64/libc.so.6
#1  0x0000ffff914f3da0 in abort () from /lib64/libc.so.6
#2  0x0000000000904bd0 in ExceptionalCondition (conditionName=conditionName@entry=0xa5ba88 "entry->trans == NULL",
    errorType=errorType@entry=0x95da10 "FailedAssertion", fileName=fileName@entry=0xa5b1b8 "pgstat.c",
    lineNumber=lineNumber@entry=909) at assert.c:69
#3  0x0000000000749e64 in pgstat_report_stat (force=force@entry=true) at pgstat.c:909
#4  0x0000000000749ee8 in pgstat_beshutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:3248
#5  0x00000000007b5cd0 in shmem_exit (code=code@entry=1) at ipc.c:272
#6  0x00000000007b5dc4 in proc_exit_prepare (code=code@entry=1) at ipc.c:194
#7  0x00000000007b5e74 in proc_exit (code=code@entry=1) at ipc.c:107
#8  0x0000000000908c8c in errfinish (filename=<optimized out>, filename@entry=0x976260 "parallel.c",
lineno=lineno@entry=885, 
    funcname=funcname@entry=0x9765a8 <__func__.10> "WaitForParallelWorkersToExit") at elog.c:578
#9  0x0000000000521ad4 in WaitForParallelWorkersToExit (pcxt=pcxt@entry=0x16af54f0) at parallel.c:885
#10 0x0000000000522af8 in DestroyParallelContext (pcxt=0x16af54f0) at parallel.c:958
#11 0x00000000005230cc in AtEOXact_Parallel (isCommit=isCommit@entry=false) at parallel.c:1231
#12 0x0000000000530588 in AbortTransaction () at xact.c:2702
#13 0x0000000000531234 in AbortOutOfAnyTransaction () at xact.c:4623
#14 0x0000000000915cbc in ShutdownPostgres (code=<optimized out>, arg=<optimized out>) at postinit.c:1195
#15 0x00000000007b5c78 in shmem_exit (code=code@entry=1) at ipc.c:239
#16 0x00000000007b5dc4 in proc_exit_prepare (code=code@entry=1) at ipc.c:194
#17 0x00000000007b5e74 in proc_exit (code=code@entry=1) at ipc.c:107
#18 0x00000000007b7888 in WaitEventSetWaitBlock (nevents=1, occurred_events=0xfffff82c41b8, cur_timeout=-1,
set=0x16a6b0d8)
    at latch.c:1429
#19 WaitEventSetWait (set=0x16a6b0d8, timeout=-1, timeout@entry=0,
occurred_events=occurred_events@entry=0xfffff82c41b8, 
    nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217734) at latch.c:1309
#20 0x00000000007b7994 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0,
    wait_event_info=wait_event_info@entry=134217734) at latch.c:411
#21 0x0000000000671ccc in gather_readnext (gatherstate=<optimized out>) at nodeGather.c:386
#22 gather_getnext (gatherstate=0x16bc4c28) at nodeGather.c:277
#23 ExecGather (pstate=0x16bc4c28) at nodeGather.c:227
#24 0x0000000000668434 in ExecProcNode (node=0x16bc4c28) at ../../../src/include/executor/executor.h:244
#25 fetch_input_tuple (aggstate=aggstate@entry=0x16bc4628) at nodeAgg.c:589
#26 0x000000000066aee8 in agg_retrieve_direct (aggstate=0x16bc4628) at nodeAgg.c:2451
#27 ExecAgg (pstate=0x16bc4628) at nodeAgg.c:2171
#28 0x0000000000655a0c in ExecProcNode (node=0x16bc4628) at ../../../src/include/executor/executor.h:244
#29 ExecutePlan (execute_once=<optimized out>, dest=0x16bce798, direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x16bc4628,
estate=0x16b65eb0)
    at execMain.c:1539
#30 standard_ExecutorRun (queryDesc=0x16a90ca0, direction=<optimized out>, count=0, execute_once=<optimized out>) at
execMain.c:364

Fundamentally, pgstat_report_stat() is Assert'ing that it can never
be called within an active transaction (i.e., without AtEOXact_PgStat
having been called first).  That fails in this scenario because while
we are trying to abort the active transaction, AtEOXact_Parallel
suffers a new FATAL error, so we abandon the attempt to run the
ShutdownPostgres on-exit hook and move on to the next one.
When we get to pgstat_beshutdown_hook, that fails because
AtEOXact_PgStat was never run.

We could decide that this is just an overly-optimistic assertion
and fix it locally in pgstat.c.  However, it seems to me that we
have bigger problems here.  Were it not for the assertion failure,
we'd (probably) eventually get through all the on_proc_exit callbacks
and do exit(1), which the postmaster would think is fine.  But in
point of fact, we've missed out doing most of AbortTransaction().
Is it really safe to allow the rest of the system to keep running
in that scenario?

(Yeah, I realize that with the postmaster gone, there's no "rest
of the system" to worry about.  But the same scenario could arise
from elog(FATAL) triggered by a less dire failure.)

So what I'm wondering, basically, is if an elog(ERROR) or elog(FATAL)
occurring after we've started to run proc_exit() should be promoted
to a PANIC.  If we don't do that, how can we convince ourselves that
the system is left in an acceptable state?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2020-10-14%2000%3A04%3A08



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Minor documentation error regarding streaming replication protocol
Next
From: Andres Freund
Date:
Subject: Re: gs_group_1 crashing on 13beta2/s390x