Re: BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From Kyotaro Horiguchi
Subject Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id 20230711.155414.939136339819647605.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Noah Misch <noah@leadboat.com>)
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary
List pgsql-bugs
At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
> > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
> > > 2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1
> > 
> > Thanks Alexander.  Looking into this.  I think it is probably
> > something like: recycled standby pages are not zeroed (something we
> > already needed to do something about[1]), and when we read a recycled
> > garbage size (like your "xxxx") at the end of a page at an offset
> > where we don't have a full record header on one page, we skip the
> > ValidXLogRecordHeader() call (and always did), but the check in
> > allocate_recordbuf() which previously handled that "gracefully" (well,
> > it would try to allocate up to 1GB bogusly, but it wouldn't try to
> > allocate more than that and ereport) is a bit too late.  I probably
> > need to add an earlier not-too-big validation.  Thinking.
> 
> I agree about an earlier not-too-big validation.  Like the attached?  I
> haven't tested it with Alexander's recipe or pondered it thoroughly.

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages. However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

> > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
> 
> Regarding [1], is it still worth zeroing recycled pages on standbys and/or
> reading the whole header before allocating xl_tot_len?  (Are there benefits
> other than avoiding a 1G backend allocation or 4G frontend allocation, or is
> that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

regards.


Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #18009: Postgres Recovery not happening
Next
From: Andrew Dunstan
Date:
Subject: Re: BUG #17994: Invalidating relcache corrupts tupDesc inside ExecEvalFieldStoreDeForm()