Re: BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id ZNmMqEXJ6ECTQ0Ys@paquier.xyz
Whole thread Raw
In response to Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Noah Misch <noah@leadboat.com>)
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On Sat, Aug 12, 2023 at 08:30:34PM -0700, Noah Misch wrote:
> On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:
>> On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
>>> Any user could call pg_logical_emit_message() to silently terminate the WAL
>>> stream, which is far worse than the original bug.  So far, I'm seeing one way
>>> to clearly fix $SUBJECT without that harm.  When a record header spans a page
>>> boundary, read the next page to reassemble the header.  If
>>> !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
>>> Otherwise, read the whole record in chunks, calculating CRC.  If CRC is
>>> invalid, treat as end of WAL.  Otherwise, ereport(FATAL) for the oversized
>>> record.  A side benefit would be avoiding useless large allocations (1GB
>>> backend, 4GB frontend) as discussed upthread.  May as well do the xl_rmid and
>>> xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
>>> allocations.  Thoughts?  For invalid-length records in v16+, since every such
>>> record is end-of-wal or corruption, those versions could skip the CRC.
>>
>> That sounds quite strong.  But... why couldn't the existing
>> xlp_rem_len cross-check protect us from this failure mode?  If we
>> could defer the allocation until after that check (and the usual
>> ValidXLogRecordHeader() check), I think we'd know that we're really
>> looking at a size that was actually written in both pages (which must
>> also have survived xlp_pageaddr check), no?

+1 for relying on xlp_rem_len for that.

> Hmm, I think that is right.  A coincidental match of the 32-bit CRC is more
> probable than having all those fields appear valid by coincidence, especially
> xlp_pageaddr.

Hmm.  [.. thinks ..]

It seems to me that we should try to also finish the header validation
before attempting XLogReadRecordAlloc() on the total_len as well?  It
looks like the end result would be to move the first ReadPageInternal
done for the header with all its cross-page checks before
XLogReadRecordAlloc().  That should remove the need of having
gotheader from v1.
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #17973: Reinit of pgstats entry for dropped DB can break autovacuum daemon
Next
From: Julien Rouhaud
Date:
Subject: Re: BUG #18054: Permission denied.