Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher
Date
Msg-id CAB7nPqS8ndjKwgekZG8dgfffOVJNcLw6bviPZ3+ShexK5E=ukg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 4/20/17 07:52, Petr Jelinek wrote:
>> On 20/04/17 05:57, Michael Paquier wrote:
>>> 2nd thoughts here... Ah now I see your point. True that there is no
>>> way to ensure that an unwanted command is not running when SIGUSR2 is
>>> received as the shutdown checkpoint may have already begun. Here is an
>>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>>> the shutdown checkpoint does not run as long as all WAL senders still
>>> running do not reach such a state.
>>
>> +1 to this solution
>
> Michael, can you attempt to supply a patch?

Hmm. I have been actually looking at this solution and I am having
doubts regarding its robustness. In short this would need to be
roughly a two-step process:
- In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
make it call ShutdownXLOG(). Prior doing that, a first signal should
be sent to all the WAL senders with
SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
used.
- At reception of this signal, all WAL senders switch to a stopping
state, refusing commands that can generate WAL.
- Checkpointer looks at the state of all WAL senders, looping with a
sleep call of a couple of ms, refusing to launch the shutdown
checkpoint as long as all WAL senders have not switched to the
stopping state.
- In reaper(), once checkpointer is confirmed as stopped, signal again
the WAL senders, and tell them to perform the last loop.

After that, I got a second, more simple idea.
CheckpointerShmem->ckpt_flags holds the information about checkpoints
currently running, so we could have the WAL senders look at this data
and prevent any commands generating WAL. The checkpointer may be
already stopped at the moment the WAL senders finish their loop, so we
need also to check if the checkpointer is running or not on those code
paths. Such safeguards may actually be enough for the problem of this
thread. Thoughts?
-- 
Michael



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: [HACKERS] Quorum commit for multiple synchronous replication.
Next
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Quorum commit for multiple synchronous replication.