From b6e7bff6a4f196303356dfb478604a58b077147c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 5 Mar 2024 20:33:14 +1300 Subject: [PATCH] Fix rare recovery shutdown hang due to checkpointer. Commit 7ff23c6d started running the checkpointer during crash recovery. As discovered by Justin, in one rare case it could prevent shutdown from succeeding during a narrow phase at the beginning of crash recovery after a server crash. When the the server is automatically restarting but before PMSIGNAL_RECOVERY_STARTED is received from the startup process, FatalError is still true. If a shutdown request arrived in that narrow window, the PostmasterStateMachine() logic behaved as if the checkpointer was not running and didn't need to be told to shut down, and yet waited forever for it to exit. Now, we can only move from PM_WAIT_BACKENDS state directly to PM_WAIT_DEADEND if the checkpointer isn't running. If it is, we now distinguish between the smart and fast shutdown case where we need to tell the checkpointer to shutdown and move to PM_SHUTDOWN, and the immediate shutdown or child crash case where it should already have been told to quit, and we're still waiting for that to happen so we stay in PM_WAIT_BACKENDS. Back-patch to 15. XXX Experimental patch, not sure yet Reported-by: Justin Pryzby Discussion: https://postgr.es/m/ZWlrdQarrZvLsgIk@pryzbyj2023 --- src/backend/postmaster/postmaster.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index da0c627107e..62db752228a 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -3748,12 +3748,14 @@ PostmasterStateMachine(void) WalSummarizerPID == 0 && BgWriterPID == 0 && (CheckpointerPID == 0 || - (!FatalError && Shutdown < ImmediateShutdown)) && + (!FatalError && Shutdown < ImmediateShutdown) || + (FatalError && CheckpointerPID != 0)) && WalWriterPID == 0 && AutoVacPID == 0 && SlotSyncWorkerPID == 0) { - if (Shutdown >= ImmediateShutdown || FatalError) + if (CheckpointerPID == 0 && + (Shutdown >= ImmediateShutdown || FatalError)) { /* * Start waiting for dead_end children to die. This state @@ -3767,7 +3769,7 @@ PostmasterStateMachine(void) * FatalError state. */ } - else + else if (Shutdown > NoShutdown && Shutdown < ImmediateShutdown) { /* * If we get here, we are proceeding with normal shutdown. All @@ -3805,6 +3807,16 @@ PostmasterStateMachine(void) signal_child(PgArchPID, SIGQUIT); } } + else + { + /* + * Either it's an immediate shutdown or a child crashed, and + * we're still waiting for all the children to quit. The + * checkpointer was already told to quit. + */ + Assert(Shutdown == ImmediateShutdown || + (Shutdown == NoShutdown && FatalError)); + } } } -- 2.43.0