Re: BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From Noah Misch
Subject Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id 20230812211327.GB2326466@rfd.leadboat.com
Whole thread Raw
In response to Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On Fri, Aug 11, 2023 at 08:32:00PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
> >> On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
> >>> I recall earlier messages theorizing that it was just harder to hit in v14, so
> >>> I'm disinclined to stop at v15.  I think the main choice is whether to stop at
> >>> v11 (normal choice) or v12 (worry about breaking the last v11 point release).
> >>> I don't have a strong opinion between those.
> 
> > Okay.  I wouldn't be inclined to patch v11 for that, FWIW, as this
> > code path is touched by recovery and more.  At least it does not seem 
> > worth taking any risk compared to the potential gain.
> 
> That was my gut reaction too -- risk/reward seems not great for the
> last v11 release.  (I've been burned a couple of times now by putting
> can't-fix-it-anymore bugs into the final release of a branch, so maybe
> I'm just being overly paranoid.  But it's something to worry about.)

Okay, v12+ would be fine with me.  Alas, a closer look at v1 found two
defects, one with severity far greater than the bug v1 aimed to fix:


===== 1. For oversized records, it changed startup FATAL to silent data loss

This affects only supported branches (v15-).  In v16+, commit 8fcb32d
introduced and checked XLogRecordMaxSize.  But in v15 with patch v1, redo of
pg_logical_emit_message(false, '_', repeat('x', 1024 * 1024 * 1024 - 1000))
just ends at the oversized record:

2523581 2023-08-12 17:17:39.512 GMT LOG:  redo starts at 0/41F155B8
2523581 2023-08-12 17:17:39.512 GMT LOG:  record length 1073740879 at 0/41F155F0 too long
2523581 2023-08-12 17:17:39.512 GMT LOG:  redo done at 0/41F155B8 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed:0.00 s
 

Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug.  So far, I'm seeing one way
to clearly fix $SUBJECT without that harm.  When a record header spans a page
boundary, read the next page to reassemble the header.  If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC.  If CRC is
invalid, treat as end of WAL.  Otherwise, ereport(FATAL) for the oversized
record.  A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread.  May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations.  Thoughts?  For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.

As an alternative, zeroing recycled pages could remove "most of" the spurious
errors without adding silent data loss.  I considered another alternative of
back-patching XLogRecordMaxSize and proceeding with a patch like v1, but that
wouldn't help with records in existing WAL archives.  Of course, we could also
choose to leave $SUBJECT unfixed in v15-.


===== 2. DecodeXLogRecordRequiredSpace() is the wrong ceiling

v1 treats DecodeXLogRecordRequiredSpace(L) (=L+2445 here) as the max needed,
but allocate_recordbuf(state, L) allocates up to L+8192.  If the uninitialized
length happened to contain an integer between about 1<<30-8192 and 1<<30-2445,
one still got e.g. "FATAL:  invalid memory alloc request size 1073741824".
Since v16+ has XLogRecordMaxSize, testing XLogRecordMaxSize avoids the problem
and should suffice:

===
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -640,7 +640,6 @@ restart:
                               (uint32) SizeOfXLogRecord, total_len);
         goto err;
     }
-#ifndef FRONTEND
 
     /*
      * We may be looking at a random uint32 read from a recycled segment.  For
@@ -651,13 +650,12 @@ restart:
      * the allocation may succeed but record checks are going to fail so this
      * would be short-lived.
      */
-    if (!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len)))
+    if (total_len > XLogRecordMaxSize)
     {
         report_invalid_record(state, "record length %u at %X/%X too long",
                               total_len, LSN_FORMAT_ARGS(RecPtr));
         goto err;
     }
-#endif
 
     /*
      * If the whole record header is on this page, validate it immediately.
===

allocate_recordbuf()'s addend is not critical; it's just to "avoid useless
small increases".  v15- should avoid the problem like this:

   ===
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 969bcc3..1bac303 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -201,4 +201,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
     newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
     newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
+#ifndef FRONTEND
+    if (!AllocSizeIsValid(newSize))
+        newSize = reclength;
+#endif
 
     if (state->readRecordBuf)
  ===

In v15, one can reach this case with pg_logical_emit_message(false, '_',
repeat('x', 1024 * 1024 * 1024 - 4000)).  On v16+, one might reach this with
unlucky trailing garbage, but XLogRecordMaxSize prevents reaching it through
pg_logical_emit_message().


Thanks,
nm



pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #17973: Reinit of pgstats entry for dropped DB can break autovacuum daemon
Next
From: Thomas Munro
Date:
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary