Home > mailing lists

Re: Race conditions with checkpointer and shutdown - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Race conditions with checkpointer and shutdown
Date	April 18, 2019 22:22:41
Msg-id	29459.1555626161@sss.pgh.pa.us Whole thread Raw
In response to	Re: Race conditions with checkpointer and shutdown (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: Race conditions with checkpointer and shutdown
List	pgsql-hackers

Tree view

Thomas Munro <thomas.munro@gmail.com> writes:
> This is a curious thing from dragonet's log:

> 2019-04-16 08:23:24.178 CEST [8335] LOG:  received fast shutdown request
> 2019-04-16 08:23:24.178 CEST [8335] LOG:  aborting any active transactions
> 2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
> process due to administrator command
> 2019-04-16 08:28:23.166 CEST [8337] LOG:  restartpoint starting: time

> LogCheckpointStart() is the thing that writes "starting: ...", and it
> prefers to report "shutdown" over "time", but it didn't.

Yeah, but since we don't see "shutting down", we know that the shutdown
checkpoint hasn't begun.  Here's another similar case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54

The relevant fragment of the standby server's log is

2018-11-30 05:09:22.996 PST [4229] LOG:  received fast shutdown request
2018-11-30 05:09:23.628 PST [4229] LOG:  aborting any active transactions
2018-11-30 05:09:23.649 PST [4231] LOG:  checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed,
1recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563
kB,estimate=16563 kB 
2018-11-30 05:09:23.679 PST [4229] LOG:  background worker "logical replication launcher" (PID 4276) exited with exit
code1 
2018-11-30 05:11:23.757 PST [4288] master LOG:  unexpected EOF on standby connection
2018-11-30 05:11:23.883 PST [4229] LOG:  received immediate shutdown request
2018-11-30 05:11:23.907 PST [4229] LOG:  database system is shut down

To the extent that I've found logs in which the checkpointer prints
anything at all during this interval, it seems to be just quietly
plodding along with its usual business, without any hint that it's
aware of the pending shutdown request.  It'd be very easy to believe
that the postmaster -> checkpointer SIGUSR2 is simply getting dropped,
or never issued.

Hmm ... actually, looking at the postmaster's logic, it won't issue
SIGUSR2 to the checkpointer until the walreceiver (if any) is gone.
And now that I think about it, several of these logs contain traces
showing that the walreceiver is still live.  Like the one quoted above:
seems like the line from PID 4288 has to be from a walreceiver.

Maybe what we should be looking for is "why doesn't the walreceiver
shut down"?  But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit.  Tis a puzzlement.

I'm a bit tempted to add temporary debug logging to the postmaster so
that walreceiver start/stop is recorded at LOG level.  We'd have to wait
a few weeks to have any clear result from the buildfarm, but I'm not sure
how we'll get any hard data without some such measures.

            regards, tom lane

pgsql-hackers by date:

From: Thomas Munro
Date: 18 April 2019, 21:58:40
Subject: Re: Race conditions with checkpointer and shutdown

From: Stephen Frost
Date: 18 April 2019, 22:39:46
Subject: Re: block-level incremental backup

Re: Race conditions with checkpointer and shutdown - Mailing list pgsql-hackers

Previous

Next