Re: 001_rep_changes.pl fails due to publisher stuck on shutdown - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
Date
Msg-id ZmeZ6A7iSmytaTYq@paquier.xyz
Whole thread Raw
In response to Re: 001_rep_changes.pl fails due to publisher stuck on shutdown  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
List pgsql-hackers
On Thu, Jun 06, 2024 at 03:19:20PM +0900, Kyotaro Horiguchi wrote:
> During server shutdown, the latter half of the last continuation
> record may fail to be flushed. This is similar to what is described in
> the commit message of commit ff9f111bce. While shutting down,
> WalSndLoop() waits for XLogSendLogical() to consume WAL up to
> flushPtr, but in this case, the last record cannot complete without
> the continuation part starting from flushPtr, which is
> missing. However, in such cases, xlogreader.missingContrecPtr is set
> to the beginning of the missing part, but something similar to

-    /* If EndRecPtr is still past our flushPtr, it means we caught up. */
-    if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+    /*
+     * If EndRecPtr is still past our flushPtr, it means we caught up.  When
+     * the server is shutting down, the latter part of a continuation record
+     * may be missing.  If got_STOPPING is true, assume we are caught up if the
+     * last record is missing its continuation part at flushPtr.
+     */
+    if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+        (got_STOPPING &&
+         logical_decoding_ctx->reader->missingContrecPtr == flushPtr))

FWIW, I don't have a better idea than what you are proposing here.  We
just cannot receive more data past the page boundary in a shutdown
sequence in this context, so checking after the missingContrecPtr
is a good compromise to ensure that we don't remain stuck when
shutting down a logical WAL sender.  I'm surprised that we did not
hear about that more often on the lists, or perhaps we did but just
discarded it?

This is going to take some time to check across all the branches down
to v12 that this is stable, because this area of the code tends to
change slightly every year..  Well, that's my job.

> So, I believe the attached small patch fixes the behavior. I haven't
> come up with a good test script for this issue. Something like
> 026_overwrite_contrecord.pl might work, but this situation seems a bit
> more complex than what it handles.

Hmm.  Indeed you will not be able to reuse the same trick with the end
of a segment.  Still you should be able to get a rather stable test by
using the same tricks as 039_end_of_wal.pl to spawn a record across
multiple pages, no?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Improve the granularity of PQsocketPoll's timeout parameter?
Next
From: Michael Paquier
Date:
Subject: Re: Format the code in xact_decode