Re: Race conditions with checkpointer and shutdown - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Race conditions with checkpointer and shutdown |
Date | |
Msg-id | 29459.1555626161@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Race conditions with checkpointer and shutdown (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Race conditions with checkpointer and shutdown
|
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> writes: > This is a curious thing from dragonet's log: > 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request > 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > process due to administrator command > 2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time > LogCheckpointStart() is the thing that writes "starting: ...", and it > prefers to report "shutdown" over "time", but it didn't. Yeah, but since we don't see "shutting down", we know that the shutdown checkpoint hasn't begun. Here's another similar case: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54 The relevant fragment of the standby server's log is 2018-11-30 05:09:22.996 PST [4229] LOG: received fast shutdown request 2018-11-30 05:09:23.628 PST [4229] LOG: aborting any active transactions 2018-11-30 05:09:23.649 PST [4231] LOG: checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 1recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563 kB,estimate=16563 kB 2018-11-30 05:09:23.679 PST [4229] LOG: background worker "logical replication launcher" (PID 4276) exited with exit code1 2018-11-30 05:11:23.757 PST [4288] master LOG: unexpected EOF on standby connection 2018-11-30 05:11:23.883 PST [4229] LOG: received immediate shutdown request 2018-11-30 05:11:23.907 PST [4229] LOG: database system is shut down To the extent that I've found logs in which the checkpointer prints anything at all during this interval, it seems to be just quietly plodding along with its usual business, without any hint that it's aware of the pending shutdown request. It'd be very easy to believe that the postmaster -> checkpointer SIGUSR2 is simply getting dropped, or never issued. Hmm ... actually, looking at the postmaster's logic, it won't issue SIGUSR2 to the checkpointer until the walreceiver (if any) is gone. And now that I think about it, several of these logs contain traces showing that the walreceiver is still live. Like the one quoted above: seems like the line from PID 4288 has to be from a walreceiver. Maybe what we should be looking for is "why doesn't the walreceiver shut down"? But the dragonet log you quote above shows the walreceiver exiting, or at least starting to exit. Tis a puzzlement. I'm a bit tempted to add temporary debug logging to the postmaster so that walreceiver start/stop is recorded at LOG level. We'd have to wait a few weeks to have any clear result from the buildfarm, but I'm not sure how we'll get any hard data without some such measures. regards, tom lane
pgsql-hackers by date: