Re: BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id ZNsXBFsFsKcCbP0q@paquier.xyz
Whole thread Raw
In response to Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Michael Paquier <michael@paquier.xyz>)
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On Tue, Aug 15, 2023 at 12:00:30PM +0900, Michael Paquier wrote:
> Not sure if that will help, but what I was playing with some stuff in
> the lines of:
> -- Store the length up to page boundary.
> select setting::int - ((pg_current_wal_insert_lsn() - '0/0') %
>   setting::int) as boundary from pg_settings where name = 'wal_block_size'
>   \gset
> -- Generate record up to boundary (56 bytes for base size of the record,
> -- stop at 12 bytes before the end of the page.
> select pg_logical_emit_message(false, '', repeat('a', :boundary - 56 - 12));
>
> Then by injecting some FF's on the last page written and forcing
> replay I am able to force some of the error code paths, so I guess
> that's what you were basically doing?

I've been spending some extra time on this one and hacked a TAP test
that reliably reproduces the original issue, using a message similar
to what I mentioned in my previous messages.  I guess that we could
use something like that:
2023-08-15 15:07:03.790 JST [8729] LOG:  redo starts at 0/14EA428
2023-08-15 15:07:03.790 JST [8729] FATAL:  invalid memory alloc
request size 4294969740 2023-08-15
15:07:03.791 JST [8726] LOG:  startup process (PID 8729) exited with exit code 1

The proposed patches pass the test, HEAD does not.  We may want to do
more with page boundaries, and more error patterns, but the idea looks
worth exploring more.  At least this can be used to validate patches.

I've noticed while hacking the test that we don't do a XLogFlush()
after inserting the message's record, so we may lose it on crash.
That makes the test unstable except if an extra record is added after
the logical messages.  The attached patch forces that for the sake of
the test, but I'm spawning a different thread as losing this data
looks like a bug to me.
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Next
From: PG Bug reporting form
Date:
Subject: BUG #18057: unaccent removes intentional spaces