Thread: walsender bug: stuck during shutdown
Hello Chloe Dives reported that sometimes a walsender would become stuck during shutdown and *not* shutdown, thus preventing postmaster from completing the shutdown cycle. This has been observed to cause the servers to remain in such state for several hours. After a lengthy investigation and thanks to a handy reproducer by Chris Wilson, we found that the problem is that WalSndDone wants to avoid shutting down until everything has been sent and acknowledged; but this test is coded in a way that ignores the possibility that we have never received anything from the other end. In that case, both MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition in WalSndDone to terminate the loop is never fulfilled. So the walsender is looping forever and never terminates, blocking shutdown of the whole instance. The attached patch fixes the problem by testing for the problematic condition. Apparently this problem has existed forever. Fujii-san almost patched for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch. -- Álvaro Herrera
Attachment
On 2020/11/24 5:52, Alvaro Herrera wrote: > Hello > > Chloe Dives reported that sometimes a walsender would become stuck > during shutdown and *not* shutdown, thus preventing postmaster from > completing the shutdown cycle. This has been observed to cause the > servers to remain in such state for several hours. > > After a lengthy investigation and thanks to a handy reproducer by Chris > Wilson, we found that the problem is that WalSndDone wants to avoid > shutting down until everything has been sent and acknowledged; but this > test is coded in a way that ignores the possibility that we have never > received anything from the other end. In that case, both > MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition > in WalSndDone to terminate the loop is never fulfilled. So the > walsender is looping forever and never terminates, blocking shutdown of > the whole instance. > > The attached patch fixes the problem by testing for the problematic > condition. > > Apparently this problem has existed forever. Fujii-san almost patched > for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch. Thanks for working on this! Could you tell me the discussion thread where Chloe Dives reported the issue to? Sorry I could not find that.. I'd like to see the procedure to reproduce the issue. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hello, On 2020-Nov-24, Fujii Masao wrote: > Thanks for working on this! > Could you tell me the discussion thread where Chloe Dives reported the issue to? > Sorry I could not find that.. It was not public -- sorry I didn't make that clear. > I'd like to see the procedure to reproduce the issue. Here's the script. Thanks!
Attachment
On 2020/11/25 0:07, Alvaro Herrera wrote: > Hello, > > On 2020-Nov-24, Fujii Masao wrote: > >> Thanks for working on this! >> Could you tell me the discussion thread where Chloe Dives reported the issue to? >> Sorry I could not find that.. > > It was not public -- sorry I didn't make that clear. > >> I'd like to see the procedure to reproduce the issue. > > Here's the script. Thanks! But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr, walsender should keep waiting for the ack to all the sent message to be replied, i.e., isn't this expected behavior of normal shutdown? That is, if we want to shutdown walsender even when the client side doesn't reply message, immediate shutdown should be used or the client side should be terminated, instead? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020-Nov-25, Fujii Masao wrote: > But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr, > walsender should keep waiting for the ack to all the sent message to be > replied, i.e., isn't this expected behavior of normal shutdown? That is, > if we want to shutdown walsender even when the client side doesn't > reply message, immediate shutdown should be used or the client side > should be terminated, instead? I don't think "waiting forever" can be considered the expected behavior; this has caused what are nominally production outages several times already, since we sent a shutdown signal to the server and it never completed shutting down. If you want to propose a better patch to address the issue, feel free, but keeping things as they are seems a bad idea to me. (Hmm, maybe another idea would be to have WalSndDone cause a keepalive message to be sent to the client and complete shut down when we get a reply to that.)
On 2020/11/26 3:45, Alvaro Herrera wrote: > On 2020-Nov-25, Fujii Masao wrote: > >> But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr, >> walsender should keep waiting for the ack to all the sent message to be >> replied, i.e., isn't this expected behavior of normal shutdown? That is, >> if we want to shutdown walsender even when the client side doesn't >> reply message, immediate shutdown should be used or the client side >> should be terminated, instead? > > I don't think "waiting forever" can be considered the expected behavior; > this has caused what are nominally production outages several times > already, since we sent a shutdown signal to the server and it never > completed shutting down. On the second thought, walsender doesn't wait forever unless wal_sender_timeout is disabled, even in the case in discussion? Or if there is the case where wal_sender_timeout doesn't work expectedly, we might need to fix that at first. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020-Nov-26, Fujii Masao wrote: > On the second thought, walsender doesn't wait forever unless > wal_sender_timeout is disabled, even in the case in discussion? > Or if there is the case where wal_sender_timeout doesn't work expectedly, > we might need to fix that at first. Hmm, no, it doesn't wait forever in that sense; tracing with the debugger shows that the process is looping continuously.
On 2020/11/26 11:45, Alvaro Herrera wrote: > On 2020-Nov-26, Fujii Masao wrote: > >> On the second thought, walsender doesn't wait forever unless >> wal_sender_timeout is disabled, even in the case in discussion? >> Or if there is the case where wal_sender_timeout doesn't work expectedly, >> we might need to fix that at first. > > Hmm, no, it doesn't wait forever in that sense; tracing with the > debugger shows that the process is looping continuously. Yes, so the problem here is that walsender goes into the busy loop in that case. Seems this happens only in logical replication walsender. In physical replication walsender, WaitLatchOrSocket() in WalSndLoop() seems to work as expected and prevent it from entering into busy loop even in that case. /* * If postmaster asked us to stop, don't wait anymore. * * It's important to do this check after the recomputation of * RecentFlushPtr, so we can send all remaining data before shutting * down. */ if (got_STOPPING) break; The above code in WalSndWaitForWal() seems to cause this issue. But I've not come up with idea about how to fix yet. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2020-Nov-26, Fujii Masao wrote: > Yes, so the problem here is that walsender goes into the busy loop > in that case. Seems this happens only in logical replication walsender. > In physical replication walsender, WaitLatchOrSocket() in WalSndLoop() > seems to work as expected and prevent it from entering into busy loop > even in that case. > > /* > * If postmaster asked us to stop, don't wait anymore. > * > * It's important to do this check after the recomputation of > * RecentFlushPtr, so we can send all remaining data before shutting > * down. > */ > if (got_STOPPING) > break; > > The above code in WalSndWaitForWal() seems to cause this issue. But I've > not come up with idea about how to fix yet. With DEBUG1 I observe that walsender is getting a lot of 'r' messages (standby reply) with all zeroes: 2020-12-01 21:01:24.100 -03 [15307] DEBUG: write 0/0 flush 0/0 apply 0/0 However, while doing that I also observed that if I do send some activity to the logical replication stream, with the provided program, it will *still* have the 'write' pointer set to 0/0, and the 'flush' pointer has moved forward to what was sent. I'm not clear on what causes the write pointer to move forward in logical replication. Still, the previously proposed patch does resolve the problem in either case.