Thread: walsender bug: stuck during shutdown

walsender bug: stuck during shutdown

From
Alvaro Herrera
Date:
Hello

Chloe Dives reported that sometimes a walsender would become stuck
during shutdown and *not* shutdown, thus preventing postmaster from
completing the shutdown cycle.  This has been observed to cause the
servers to remain in such state for several hours.

After a lengthy investigation and thanks to a handy reproducer by Chris
Wilson, we found that the problem is that WalSndDone wants to avoid
shutting down until everything has been sent and acknowledged; but this
test is coded in a way that ignores the possibility that we have never
received anything from the other end.  In that case, both
MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition
in WalSndDone to terminate the loop is never fulfilled.  So the
walsender is looping forever and never terminates, blocking shutdown of
the whole instance.

The attached patch fixes the problem by testing for the problematic
condition.

Apparently this problem has existed forever.  Fujii-san almost patched
for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch.

-- 
Álvaro Herrera

Attachment

Re: walsender bug: stuck during shutdown

From
Fujii Masao
Date:

On 2020/11/24 5:52, Alvaro Herrera wrote:
> Hello
> 
> Chloe Dives reported that sometimes a walsender would become stuck
> during shutdown and *not* shutdown, thus preventing postmaster from
> completing the shutdown cycle.  This has been observed to cause the
> servers to remain in such state for several hours.
> 
> After a lengthy investigation and thanks to a handy reproducer by Chris
> Wilson, we found that the problem is that WalSndDone wants to avoid
> shutting down until everything has been sent and acknowledged; but this
> test is coded in a way that ignores the possibility that we have never
> received anything from the other end.  In that case, both
> MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition
> in WalSndDone to terminate the loop is never fulfilled.  So the
> walsender is looping forever and never terminates, blocking shutdown of
> the whole instance.
> 
> The attached patch fixes the problem by testing for the problematic
> condition.
> 
> Apparently this problem has existed forever.  Fujii-san almost patched
> for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch.

Thanks for working on this!
Could you tell me the discussion thread where Chloe Dives reported the issue to?
Sorry I could not find that..
I'd like to see the procedure to reproduce the issue.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: walsender bug: stuck during shutdown

From
Alvaro Herrera
Date:
Hello,

On 2020-Nov-24, Fujii Masao wrote:

> Thanks for working on this!
> Could you tell me the discussion thread where Chloe Dives reported the issue to?
> Sorry I could not find that..

It was not public -- sorry I didn't make that clear.

> I'd like to see the procedure to reproduce the issue.

Here's the script.


Thanks!

Attachment

Re: walsender bug: stuck during shutdown

From
Fujii Masao
Date:

On 2020/11/25 0:07, Alvaro Herrera wrote:
> Hello,
> 
> On 2020-Nov-24, Fujii Masao wrote:
> 
>> Thanks for working on this!
>> Could you tell me the discussion thread where Chloe Dives reported the issue to?
>> Sorry I could not find that..
> 
> It was not public -- sorry I didn't make that clear.
> 
>> I'd like to see the procedure to reproduce the issue.
> 
> Here's the script.

Thanks!

But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr,
walsender should keep waiting for the ack to all the sent message to be
replied, i.e., isn't this expected behavior of normal shutdown? That is,
if we want to shutdown walsender even when the client side doesn't
reply message, immediate shutdown should be used or the client side
should be terminated, instead?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: walsender bug: stuck during shutdown

From
Alvaro Herrera
Date:
On 2020-Nov-25, Fujii Masao wrote:

> But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr,
> walsender should keep waiting for the ack to all the sent message to be
> replied, i.e., isn't this expected behavior of normal shutdown?  That is,
> if we want to shutdown walsender even when the client side doesn't
> reply message, immediate shutdown should be used or the client side
> should be terminated, instead?

I don't think "waiting forever" can be considered the expected behavior;
this has caused what are nominally production outages several times
already, since we sent a shutdown signal to the server and it never
completed shutting down.

If you want to propose a better patch to address the issue, feel free,
but keeping things as they are seems a bad idea to me.

(Hmm, maybe another idea would be to have WalSndDone cause a keepalive
message to be sent to the client and complete shut down when we get a
reply to that.)



Re: walsender bug: stuck during shutdown

From
Fujii Masao
Date:

On 2020/11/26 3:45, Alvaro Herrera wrote:
> On 2020-Nov-25, Fujii Masao wrote:
> 
>> But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr,
>> walsender should keep waiting for the ack to all the sent message to be
>> replied, i.e., isn't this expected behavior of normal shutdown?  That is,
>> if we want to shutdown walsender even when the client side doesn't
>> reply message, immediate shutdown should be used or the client side
>> should be terminated, instead?
> 
> I don't think "waiting forever" can be considered the expected behavior;
> this has caused what are nominally production outages several times
> already, since we sent a shutdown signal to the server and it never
> completed shutting down.

On the second thought, walsender doesn't wait forever unless
wal_sender_timeout is disabled, even in the case in discussion?
Or if there is the case where wal_sender_timeout doesn't work expectedly,
we might need to fix that at first.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: walsender bug: stuck during shutdown

From
Alvaro Herrera
Date:
On 2020-Nov-26, Fujii Masao wrote:

> On the second thought, walsender doesn't wait forever unless
> wal_sender_timeout is disabled, even in the case in discussion?
> Or if there is the case where wal_sender_timeout doesn't work expectedly,
> we might need to fix that at first.

Hmm, no, it doesn't wait forever in that sense; tracing with the
debugger shows that the process is looping continuously.



Re: walsender bug: stuck during shutdown

From
Fujii Masao
Date:

On 2020/11/26 11:45, Alvaro Herrera wrote:
> On 2020-Nov-26, Fujii Masao wrote:
> 
>> On the second thought, walsender doesn't wait forever unless
>> wal_sender_timeout is disabled, even in the case in discussion?
>> Or if there is the case where wal_sender_timeout doesn't work expectedly,
>> we might need to fix that at first.
> 
> Hmm, no, it doesn't wait forever in that sense; tracing with the
> debugger shows that the process is looping continuously.

Yes, so the problem here is that walsender goes into the busy loop
in that case. Seems this happens only in logical replication walsender.
In physical replication walsender, WaitLatchOrSocket() in WalSndLoop()
seems to work as expected and prevent it from entering into busy loop
even in that case.

        /*
         * If postmaster asked us to stop, don't wait anymore.
         *
         * It's important to do this check after the recomputation of
         * RecentFlushPtr, so we can send all remaining data before shutting
         * down.
         */
        if (got_STOPPING)
            break;

The above code in WalSndWaitForWal() seems to cause this issue. But I've
not come up with idea about how to fix yet.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: walsender bug: stuck during shutdown

From
Alvaro Herrera
Date:
On 2020-Nov-26, Fujii Masao wrote:

> Yes, so the problem here is that walsender goes into the busy loop
> in that case. Seems this happens only in logical replication walsender.
> In physical replication walsender, WaitLatchOrSocket() in WalSndLoop()
> seems to work as expected and prevent it from entering into busy loop
> even in that case.
> 
>         /*
>          * If postmaster asked us to stop, don't wait anymore.
>          *
>          * It's important to do this check after the recomputation of
>          * RecentFlushPtr, so we can send all remaining data before shutting
>          * down.
>          */
>         if (got_STOPPING)
>             break;
> 
> The above code in WalSndWaitForWal() seems to cause this issue. But I've
> not come up with idea about how to fix yet.

With DEBUG1 I observe that walsender is getting a lot of 'r' messages
(standby reply) with all zeroes:

2020-12-01 21:01:24.100 -03 [15307] DEBUG:  write 0/0 flush 0/0 apply 0/0

However, while doing that I also observed that if I do send some
activity to the logical replication stream, with the provided program,
it will *still* have the 'write' pointer set to 0/0, and the 'flush'
pointer has moved forward to what was sent.  I'm not clear on what
causes the write pointer to move forward in logical replication.

Still, the previously proposed patch does resolve the problem in either
case.