Failures during FATAL exit - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Failures during FATAL exit |
Date | |
Msg-id | 3634445.1602711426@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
As already noted in another thread, buildfarm member chipmunk failed today with an unexpected Assert [1]. I've now reproduced this by manually killing the postmaster during the regression tests. The stack trace looks like #0 0x0000ffff91507598 in raise () from /lib64/libc.so.6 #1 0x0000ffff914f3da0 in abort () from /lib64/libc.so.6 #2 0x0000000000904bd0 in ExceptionalCondition (conditionName=conditionName@entry=0xa5ba88 "entry->trans == NULL", errorType=errorType@entry=0x95da10 "FailedAssertion", fileName=fileName@entry=0xa5b1b8 "pgstat.c", lineNumber=lineNumber@entry=909) at assert.c:69 #3 0x0000000000749e64 in pgstat_report_stat (force=force@entry=true) at pgstat.c:909 #4 0x0000000000749ee8 in pgstat_beshutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:3248 #5 0x00000000007b5cd0 in shmem_exit (code=code@entry=1) at ipc.c:272 #6 0x00000000007b5dc4 in proc_exit_prepare (code=code@entry=1) at ipc.c:194 #7 0x00000000007b5e74 in proc_exit (code=code@entry=1) at ipc.c:107 #8 0x0000000000908c8c in errfinish (filename=<optimized out>, filename@entry=0x976260 "parallel.c", lineno=lineno@entry=885, funcname=funcname@entry=0x9765a8 <__func__.10> "WaitForParallelWorkersToExit") at elog.c:578 #9 0x0000000000521ad4 in WaitForParallelWorkersToExit (pcxt=pcxt@entry=0x16af54f0) at parallel.c:885 #10 0x0000000000522af8 in DestroyParallelContext (pcxt=0x16af54f0) at parallel.c:958 #11 0x00000000005230cc in AtEOXact_Parallel (isCommit=isCommit@entry=false) at parallel.c:1231 #12 0x0000000000530588 in AbortTransaction () at xact.c:2702 #13 0x0000000000531234 in AbortOutOfAnyTransaction () at xact.c:4623 #14 0x0000000000915cbc in ShutdownPostgres (code=<optimized out>, arg=<optimized out>) at postinit.c:1195 #15 0x00000000007b5c78 in shmem_exit (code=code@entry=1) at ipc.c:239 #16 0x00000000007b5dc4 in proc_exit_prepare (code=code@entry=1) at ipc.c:194 #17 0x00000000007b5e74 in proc_exit (code=code@entry=1) at ipc.c:107 #18 0x00000000007b7888 in WaitEventSetWaitBlock (nevents=1, occurred_events=0xfffff82c41b8, cur_timeout=-1, set=0x16a6b0d8) at latch.c:1429 #19 WaitEventSetWait (set=0x16a6b0d8, timeout=-1, timeout@entry=0, occurred_events=occurred_events@entry=0xfffff82c41b8, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217734) at latch.c:1309 #20 0x00000000007b7994 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0, wait_event_info=wait_event_info@entry=134217734) at latch.c:411 #21 0x0000000000671ccc in gather_readnext (gatherstate=<optimized out>) at nodeGather.c:386 #22 gather_getnext (gatherstate=0x16bc4c28) at nodeGather.c:277 #23 ExecGather (pstate=0x16bc4c28) at nodeGather.c:227 #24 0x0000000000668434 in ExecProcNode (node=0x16bc4c28) at ../../../src/include/executor/executor.h:244 #25 fetch_input_tuple (aggstate=aggstate@entry=0x16bc4628) at nodeAgg.c:589 #26 0x000000000066aee8 in agg_retrieve_direct (aggstate=0x16bc4628) at nodeAgg.c:2451 #27 ExecAgg (pstate=0x16bc4628) at nodeAgg.c:2171 #28 0x0000000000655a0c in ExecProcNode (node=0x16bc4628) at ../../../src/include/executor/executor.h:244 #29 ExecutePlan (execute_once=<optimized out>, dest=0x16bce798, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x16bc4628, estate=0x16b65eb0) at execMain.c:1539 #30 standard_ExecutorRun (queryDesc=0x16a90ca0, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 Fundamentally, pgstat_report_stat() is Assert'ing that it can never be called within an active transaction (i.e., without AtEOXact_PgStat having been called first). That fails in this scenario because while we are trying to abort the active transaction, AtEOXact_Parallel suffers a new FATAL error, so we abandon the attempt to run the ShutdownPostgres on-exit hook and move on to the next one. When we get to pgstat_beshutdown_hook, that fails because AtEOXact_PgStat was never run. We could decide that this is just an overly-optimistic assertion and fix it locally in pgstat.c. However, it seems to me that we have bigger problems here. Were it not for the assertion failure, we'd (probably) eventually get through all the on_proc_exit callbacks and do exit(1), which the postmaster would think is fine. But in point of fact, we've missed out doing most of AbortTransaction(). Is it really safe to allow the rest of the system to keep running in that scenario? (Yeah, I realize that with the postmaster gone, there's no "rest of the system" to worry about. But the same scenario could arise from elog(FATAL) triggered by a less dire failure.) So what I'm wondering, basically, is if an elog(ERROR) or elog(FATAL) occurring after we've started to run proc_exit() should be promoted to a PANIC. If we don't do that, how can we convince ourselves that the system is left in an acceptable state? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2020-10-14%2000%3A04%3A08
pgsql-hackers by date: