Hi,
While testing something I made the checkpointer process intentionally crash as
soon as it started up. The odd thing I observed on macOS is that we start a
*new* checkpointer before shutting down:
2023-07-29 14:32:39.241 PDT [65031] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-07-29 14:32:39.244 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.244 PDT [65031] LOG: checkpointer process (PID 65032) was terminated by signal 11: Segmentation
fault:11
2023-07-29 14:32:39.244 PDT [65031] LOG: terminating any other active server processes
2023-07-29 14:32:39.244 PDT [65031] DEBUG: sending SIGQUIT to process 65034
2023-07-29 14:32:39.245 PDT [65031] DEBUG: sending SIGQUIT to process 65033
2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.245 PDT [65035] LOG: process 65035 taking over ProcSignal slot 126, but it's not empty
2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.245 PDT [65031] LOG: shutting down because restart_after_crash is off
Note that a new process (65035) is started after the crash has been
observed. I added logging to StartChildProcess(), and the process that's
started is another checkpointer.
I could not initially reproduce this on linux.
After a fair bit of confusion, I figured out the reason: On macOS it takes a
bit longer for the startup process to finish, which means we're still in
PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or
...
The problem is that unfortunately HandleChildCrash() doesn't change pmState
when in PM_STARTUP:
/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the
problem reproduces on linux as well.
I haven't fully dug through the history, this looks to be a quite old problem.
Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad
consequence of that.
Greetings,
Andres Freund