Thread: BUG #17928: Standby fails to decode WAL on termination of primary

BUG #17928: Standby fails to decode WAL on termination of primary

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17928
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 22.04
Description:

When a primary server is stopped, standby might fail to replay WAL with
the memory allocation error, for example:
2023-05-11 08:42:38.254 MSK [1706654] LOG:  recovery restart point at
7/12012E88
2023-05-11 08:42:38.254 MSK [1706654] DETAIL:  Last completed transaction
was at log time 2023-05-11 08:42:38.245399+03.
2023-05-11 08:42:38.420 MSK [1706654] LOG:  restartpoint starting: wal
2023-05-11 08:42:38.451 MSK [1725929] FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2023-05-11 08:42:38.468 MSK [1706656] FATAL:  invalid memory alloc request
size 2021163525
2023-05-11 08:42:38.469 MSK [1706643] LOG:  startup process (PID 1706656)
exited with exit code 1
2023-05-11 08:42:38.469 MSK [1706643] LOG:  terminating any other active
server processes
2023-05-11 08:42:38.469 MSK [1706643] LOG:  shutting down due to startup
process failure
2023-05-11 08:42:38.470 MSK [1706643] LOG:  database system is shut down

The call stack of the error:
...
#6  0x000055cfabab8e97 in palloc_extended (size=2021163525, flags=2) at
mcxt.c:1143
#7  0x000055cfab456ae0 in XLogReadRecordAlloc (state=0x55cfacef4a08,
xl_tot_len=2021161080, 
    allow_oversized=true) at xlogreader.c:524
#8  0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
#9  0x000055cfab4575c0 in XLogReadAhead (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:960
#10 0x000055cfab45519d in XLogPrefetcherNextBlock
(pgsr_private=94350447959632, lsn=0x55cfacf4bc58)
    at xlogprefetcher.c:496
#11 0x000055cfab454b1c in lrq_prefetch (lrq=0x55cfacf4bb08) at
xlogprefetcher.c:256
#12 0x000055cfab454cdf in lrq_complete_lsn (lrq=0x55cfacf4bb08,
lsn=2976923632) at xlogprefetcher.c:294
#13 0x000055cfab455df2 in XLogPrefetcherReadRecord
(prefetcher=0x55cfacef5e50, errmsg=0x7ffc2e410458)
    at xlogprefetcher.c:1039
#14 0x000055cfab45f136 in ReadRecord (xlogprefetcher=0x55cfacef5e50,
emode=15, fetching_ckpt=false, 
    replayTLI=1) at xlogrecovery.c:3047
#15 0x000055cfab45c922 in PerformWalRecovery () at xlogrecovery.c:1754
#16 0x000055cfab448869 in StartupXLOG () at xlog.c:5300
#17 0x000055cfab7d151a in StartupProcessMain () at startup.c:267
#18 0x000055cfab7c3f80 in AuxiliaryProcessMain (auxtype=StartupProcess) at
auxprocess.c:141
#19 0x000055cfab7cfb4b in StartChildProcess (type=StartupProcess) at
postmaster.c:5429
#20 0x000055cfab7ca5c0 in PostmasterMain (argc=4, argv=0x55cfacef3e40) at
postmaster.c:1470
#21 0x000055cfab6be302 in main (argc=4, argv=0x55cfacef3e40) at main.c:202

(gdb) frame 8
#8  0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
689             decoded = XLogReadRecordAlloc(state,
(gdb) print/x *record
$5 = {xl_tot_len = 0x78787878, xl_xid = 0x78787878, xl_prev =
0x7878787878787878, xl_info = 0x0, 
  xl_rmid = 0x10, xl_crc = 0x0}
(gdb) print/x state->NextRecPtr
$6 = 0xb1703ff0

hexdump -C .../standby_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56  00 00 00 00 00 00 00 00 ...@...V........
00703ff0 78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
00704000 10 d1 05 00 01 00 00 00  00 40 70 30 00 00 00 00 .........@p0....
00704010 db 00 00 00 00 00 00 00  78 78 78 78 78 78 78 78 ........xxxxxxxx
00704020 78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
007040f0 09 00 00 00 00 00 00 00  33 02 00 00 1b 0d 00 00 ........3.......

hexdump -C .../primary_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56  00 00 00 00 00 00 00 00 ...@...V........
00703ff0 33 02 00 00 54 2a 00 00  e0 3e 70 b1 00 00 00 00 3...T*...>p.....
00704000 10 d1 05 00 01 00 00 00  00 40 70 b1 00 00 00 00 .........@p.....
00704010 23 02 00 00 00 00 00 00  80 0a 00 00 55 0f 07 2c #...........U..,
00704020 00 60 02 02 7f 06 00 00  05 00 00 00 f7 85 01 00 .`..............
00704030 00 00 00 00 ff 03 02 00  02 08 18 00 01 00 00 00 ................
00704040 e0 07 00 00 78 78 78 78  78 78 78 78 78 78 78 78 ....xxxxxxxxxxxx
00704050 78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
00704230 78 78 78 78 78 78 78 78  01 00 00 00 00 00 00 00 xxxxxxxx........

(In this case, rows with 500 'x' were inserted when the primary was
stopped.)

`git bisect` for this behavior blames 3f1ce9734 (where
XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).

A reproducer for the anomaly to follow.


Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Alexander Lakhin
Date:
11.05.2023 11:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17928
> ...
> `git bisect` for this behavior blames 3f1ce9734 (where
> XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).
>
> A reproducer for the anomaly to follow.
The TAP test that demonstrates the issue is attached. To catch the failure
faster, I place it in multiple directories src/test/recoveryX/t, add
minimal Makefiles, and run (on tmpfs):
for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j7 --linebuffer --tag make

-s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 || break; done

iteration 1
recovery1       +++ tap check in src/test/recovery1 +++
recovery2       +++ tap check in src/test/recovery2 +++
recovery3       +++ tap check in src/test/recovery3 +++
recovery4       +++ tap check in src/test/recovery4 +++
recovery5       +++ tap check in src/test/recovery5 +++
recovery6       +++ tap check in src/test/recovery6 +++
recovery7       +++ tap check in src/test/recovery7 +++
...
recovery5       # Restarting primary instance (49)
recovery3       # Restarting primary instance (49)
recovery7       # Restarting primary instance (49)
recovery2       Bailout called.  Further testing stopped:  pg_ctl stop failed
recovery2       FAILED--Further testing stopped: pg_ctl stop failed
recovery2       make: *** [Makefile:6: check] Error 255
parallel: This job failed:
make -s check -C src/test/recovery2

tail src/test/recovery2/tmp_check/log/099_restart_with_stanby_standby.log
2023-05-11 20:19:22.247 MSK [2046385] DETAIL:  End of WAL reached on timeline 1 at 3/64BDFF8.
2023-05-11 20:19:22.247 MSK [2046385] FATAL:  could not send end-of-streaming message to primary: server closed the 
connection unexpectedly
                 This probably means the server terminated abnormally
                 before or while processing the request.
         no COPY in progress
2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1
2023-05-11 20:19:22.248 MSK [2037114] LOG:  terminating any other active server processes
2023-05-11 20:19:22.248 MSK [2037114] LOG:  shutting down due to startup process failure
2023-05-11 20:19:22.249 MSK [2037114] LOG:  database system is shut down

Best regards,
Alexander
Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> 2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
> 2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1

Thanks Alexander.  Looking into this.  I think it is probably
something like: recycled standby pages are not zeroed (something we
already needed to do something about[1]), and when we read a recycled
garbage size (like your "xxxx") at the end of a page at an offset
where we don't have a full record header on one page, we skip the
ValidXLogRecordHeader() call (and always did), but the check in
allocate_recordbuf() which previously handled that "gracefully" (well,
it would try to allocate up to 1GB bogusly, but it wouldn't try to
allocate more than that and ereport) is a bit too late.  I probably
need to add an earlier not-too-big validation.  Thinking.

[1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
> On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > 2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
> > 2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1
> 
> Thanks Alexander.  Looking into this.  I think it is probably
> something like: recycled standby pages are not zeroed (something we
> already needed to do something about[1]), and when we read a recycled
> garbage size (like your "xxxx") at the end of a page at an offset
> where we don't have a full record header on one page, we skip the
> ValidXLogRecordHeader() call (and always did), but the check in
> allocate_recordbuf() which previously handled that "gracefully" (well,
> it would try to allocate up to 1GB bogusly, but it wouldn't try to
> allocate more than that and ereport) is a bit too late.  I probably
> need to add an earlier not-too-big validation.  Thinking.

I agree about an earlier not-too-big validation.  Like the attached?  I
haven't tested it with Alexander's recipe or pondered it thoroughly.

> [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

Regarding [1], is it still worth zeroing recycled pages on standbys and/or
reading the whole header before allocating xl_tot_len?  (Are there benefits
other than avoiding a 1G backend allocation or 4G frontend allocation, or is
that benefit worth the cycles?)

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Kyotaro Horiguchi
Date:
At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
> > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
> > > 2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1
> > 
> > Thanks Alexander.  Looking into this.  I think it is probably
> > something like: recycled standby pages are not zeroed (something we
> > already needed to do something about[1]), and when we read a recycled
> > garbage size (like your "xxxx") at the end of a page at an offset
> > where we don't have a full record header on one page, we skip the
> > ValidXLogRecordHeader() call (and always did), but the check in
> > allocate_recordbuf() which previously handled that "gracefully" (well,
> > it would try to allocate up to 1GB bogusly, but it wouldn't try to
> > allocate more than that and ereport) is a bit too late.  I probably
> > need to add an earlier not-too-big validation.  Thinking.
> 
> I agree about an earlier not-too-big validation.  Like the attached?  I
> haven't tested it with Alexander's recipe or pondered it thoroughly.

I like the patch for its code clean up, but I'm afraid that it removes
the existing record length check when reading continued
pages. However, I'm unsure about merely adding a check for too-long
records, due to the potential for requesting an excessively large
amount of memory, even if it will be released shortly.

> > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
> 
> Regarding [1], is it still worth zeroing recycled pages on standbys and/or
> reading the whole header before allocating xl_tot_len?  (Are there benefits
> other than avoiding a 1G backend allocation or 4G frontend allocation, or is
> that benefit worth the cycles?)

I believe reading the whole header is the most sensible approach as it
can prevent unnecessary memory requests. Another potential solution
(or hack) for this specific case is to let XLogWalRcvFlush write a
finalizing ((uint32)0) when dying is true. This stabilizes the
behavior to "invalid record length.. got 0" when running the TAP test.

regards.


Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote:
> > > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL:  invalid memory alloc request size 2021163525
> > > > 2023-05-11 20:19:22.248 MSK [2037114] LOG:  startup process (PID 2037134) exited with exit code 1

On my system, testing unpatched REL_15_2, the test reached its first failure
at minute 53 and its second failure at minute 189.  Perhaps a faster,
deterministic test would be feasible:

- Use pg_logical_emit_message to fill a few segments with 0xFF.
- CHECKPOINT the primary, so the standby recycles segments.
- One more pg_logical_emit_message, computing the length from
  pg_current_wal_insert_lsn() such that new message crosses a segment boundary
  and ends 4 bytes before the end of a page.
- Stop the primary.
- If the bug is present, the standby will exit.

> > > Thanks Alexander.  Looking into this.  I think it is probably
> > > something like: recycled standby pages are not zeroed (something we
> > > already needed to do something about[1]), and when we read a recycled
> > > garbage size (like your "xxxx") at the end of a page at an offset
> > > where we don't have a full record header on one page, we skip the
> > > ValidXLogRecordHeader() call (and always did), but the check in
> > > allocate_recordbuf() which previously handled that "gracefully" (well,
> > > it would try to allocate up to 1GB bogusly, but it wouldn't try to
> > > allocate more than that and ereport) is a bit too late.  I probably
> > > need to add an earlier not-too-big validation.  Thinking.
> > 
> > I agree about an earlier not-too-big validation.  Like the attached?  I
> > haven't tested it with Alexander's recipe or pondered it thoroughly.
> 
> I like the patch for its code clean up, but I'm afraid that it removes
> the existing record length check when reading continued
> pages.

The removed check would be repetitive in all cases known to me, but I wouldn't
mind keeping the check instead.

> However, I'm unsure about merely adding a check for too-long
> records, due to the potential for requesting an excessively large
> amount of memory, even if it will be released shortly.
> 
> > > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
> > 
> > Regarding [1], is it still worth zeroing recycled pages on standbys and/or
> > reading the whole header before allocating xl_tot_len?  (Are there benefits
> > other than avoiding a 1G backend allocation or 4G frontend allocation, or is
> > that benefit worth the cycles?)
> 
> I believe reading the whole header is the most sensible approach as it
> can prevent unnecessary memory requests. Another potential solution
> (or hack) for this specific case is to let XLogWalRcvFlush write a
> finalizing ((uint32)0) when dying is true. This stabilizes the
> behavior to "invalid record length.. got 0" when running the TAP test.

> --- a/src/backend/replication/walreceiver.c
> +++ b/src/backend/replication/walreceiver.c
> @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
>  {
>      Assert(tli != 0);
>  
> +    if (dying)
> +    {
> +        int buflen = sizeof(uint32);
> +        char buf[sizeof(uint32)];
> +        
> +        memset(buf, 0, buflen);
> +        XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli);
> +    }
> +    
>      if (LogstreamResult.Flush < LogstreamResult.Write)
>      {
>          WalRcvData *walrcv = WalRcv;

That's inexpensive in cycles, and it should make the failures in question
quite rare.  If we do that instead of pre-zeroing, I think we'd still want to
be ready for invalid lengths, for the case of unclean standby shutdown.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Kyotaro Horiguchi
Date:
At Sun, 16 Jul 2023 17:49:05 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote:
> > I like the patch for its code clean up, but I'm afraid that it removes
> > the existing record length check when reading continued
> > pages.
> 
> The removed check would be repetitive in all cases known to me, but I wouldn't
> mind keeping the check instead.

Ah. I got it. No need to do that again.

> > I believe reading the whole header is the most sensible approach as it
> > can prevent unnecessary memory requests. Another potential solution
> > (or hack) for this specific case is to let XLogWalRcvFlush write a
> > finalizing ((uint32)0) when dying is true. This stabilizes the
> > behavior to "invalid record length.. got 0" when running the TAP test.
...
> That's inexpensive in cycles, and it should make the failures in question
> quite rare.  If we do that instead of pre-zeroing, I think we'd still want to
> be ready for invalid lengths, for the case of unclean standby shutdown.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:
> On my system, testing unpatched REL_15_2, the test reached its first failure
> at minute 53 and its second failure at minute 189.  Perhaps a faster,
> deterministic test would be feasible:

It took less than five minutes for me to reproduce the failure using
the test case sent by Alexander on HEAD:
2023-08-10 15:26:37.401 JST [11239] FATAL:  invalid memory alloc request size 2021163525
2023-08-10 15:26:37.405 JST [11235] LOG:  startup process (PID 11239) exited with exit code 1

I don't see it after two hours of tests with the patch.

> - Use pg_logical_emit_message to fill a few segments with 0xFF.
> - CHECKPOINT the primary, so the standby recycles segments.
> - One more pg_logical_emit_message, computing the length from
>   pg_current_wal_insert_lsn() such that new message crosses a segment boundary
>   and ends 4 bytes before the end of a page.
> - Stop the primary.
> - If the bug is present, the standby will exit.

Good idea to pollute the data with recycled segments.  Using a minimal
WAL segment size whould help here as well in keeping a test cheap, and
two segments should be enough.  The alignment calculations and the
header size can be known, but the standby records are an issue for the
predictability of the test when it comes to adjust the length of the
logical message depending on the 8k WAL page, no?

>> However, I'm unsure about merely adding a check for too-long
>> records, due to the potential for requesting an excessively large
>> amount of memory, even if it will be released shortly.
>>
>>>> [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
>>>
>>> Regarding [1], is it still worth zeroing recycled pages on standbys and/or
>>> reading the whole header before allocating xl_tot_len?  (Are there benefits
>>> other than avoiding a 1G backend allocation or 4G frontend allocation, or is
>>> that benefit worth the cycles?)
>>
>> I believe reading the whole header is the most sensible approach as it
>> can prevent unnecessary memory requests. Another potential solution
>> (or hack) for this specific case is to let XLogWalRcvFlush write a
>> finalizing ((uint32)0) when dying is true. This stabilizes the
>> behavior to "invalid record length.. got 0" when running the TAP test.

FWIW, I came back to this thread while tweaking the error reporting of
xlogreader.c for the sake of this thread and this proposal is an
improvement to be able to make a distinction between an OOM and an
incorrect record:
https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz

Anyway, agreed that it's an improvement to remove this check out of
allocate_recordbuf().  Noah, are you planning to work more on that?

>> --- a/src/backend/replication/walreceiver.c
>> +++ b/src/backend/replication/walreceiver.c
>> @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
>>  {
>>      Assert(tli != 0);
>>
>> +    if (dying)
>> +    {
>> +        int buflen = sizeof(uint32);
>> +        char buf[sizeof(uint32)];
>> +
>> +        memset(buf, 0, buflen);
>> +        XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli);
>> +    }
>> +
>>      if (LogstreamResult.Flush < LogstreamResult.Write)
>>      {
>>          WalRcvData *walrcv = WalRcv;
>
> That's inexpensive in cycles, and it should make the failures in question
> quite rare.  If we do that instead of pre-zeroing, I think we'd still want to
> be ready for invalid lengths, for the case of unclean standby shutdown.

Tweaking the WAL receiver to do more zeroing has been discussed in the
past, but as far as I recall we were not completely sure whether it's
completely free, either.  It seems to me that this should have a
discussion on its own, in a new thread.  Not sure if we should worry
about archiving, actually, even if the segments should be padded with
zeros beforehand.

+   gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false;
Perhaps this could use more parenthesis around the expression checked,
for readability.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
> On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:
>> - Use pg_logical_emit_message to fill a few segments with 0xFF.
>> - CHECKPOINT the primary, so the standby recycles segments.
>> - One more pg_logical_emit_message, computing the length from
>>   pg_current_wal_insert_lsn() such that new message crosses a segment boundary
>>   and ends 4 bytes before the end of a page.
>> - Stop the primary.
>> - If the bug is present, the standby will exit.
>
> Good idea to pollute the data with recycled segments.  Using a minimal
> WAL segment size whould help here as well in keeping a test cheap, and
> two segments should be enough.  The alignment calculations and the
> header size can be known, but the standby records are an issue for the
> predictability of the test when it comes to adjust the length of the
> logical message depending on the 8k WAL page, no?

Actually, for this one, I think that I have a simpler idea to make it
deterministic.  Once we have inserted a record at the page limit on
the primary, we can:
- Stop the standby
- Stop the primary
- Rewrite by ourselves a few bytes in the last segment on the standby
to emulate a recycled segment portion, based on the end LSN of the
logical message record, retrieved either with pg_walinspect or
pg_waldump.
- Start the standby, which would replay up to the previous record at
the page limit.
- The standby just be in a state where it waits for the missing
records from the primary and keeps looking at streaming, but it should
not fail startup.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
> On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote:
> > On my system, testing unpatched REL_15_2, the test reached its first failure
> > at minute 53 and its second failure at minute 189.  Perhaps a faster,
> > deterministic test would be feasible:
> 
> It took less than five minutes for me to reproduce the failure using
> the test case sent by Alexander on HEAD:
> 2023-08-10 15:26:37.401 JST [11239] FATAL:  invalid memory alloc request size 2021163525
> 2023-08-10 15:26:37.405 JST [11235] LOG:  startup process (PID 11239) exited with exit code 1
> 
> I don't see it after two hours of tests with the patch.

That is good.

> > - Use pg_logical_emit_message to fill a few segments with 0xFF.
> > - CHECKPOINT the primary, so the standby recycles segments.
> > - One more pg_logical_emit_message, computing the length from
> >   pg_current_wal_insert_lsn() such that new message crosses a segment boundary
> >   and ends 4 bytes before the end of a page.
> > - Stop the primary.
> > - If the bug is present, the standby will exit.
> 
> Good idea to pollute the data with recycled segments.  Using a minimal
> WAL segment size whould help here as well in keeping a test cheap, and
> two segments should be enough.  The alignment calculations and the
> header size can be known, but the standby records are an issue for the
> predictability of the test when it comes to adjust the length of the
> logical message depending on the 8k WAL page, no?

Could be.  I expect there would be challenges translating that outline into a
real test, but I don't know if that will be a major one.  The test doesn't
need to be 100% deterministic.  If it fails 25% of the time and is not the
slowest test in the recovery suite, I'd find that good enough.

> >>>> [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
> >>> 
> >>> Regarding [1], is it still worth zeroing recycled pages on standbys and/or
> >>> reading the whole header before allocating xl_tot_len?  (Are there benefits
> >>> other than avoiding a 1G backend allocation or 4G frontend allocation, or is
> >>> that benefit worth the cycles?)
> >> 
> >> I believe reading the whole header is the most sensible approach as it
> >> can prevent unnecessary memory requests. Another potential solution
> >> (or hack) for this specific case is to let XLogWalRcvFlush write a
> >> finalizing ((uint32)0) when dying is true. This stabilizes the
> >> behavior to "invalid record length.. got 0" when running the TAP test.
> 
> FWIW, I came back to this thread while tweaking the error reporting of
> xlogreader.c for the sake of this thread and this proposal is an
> improvement to be able to make a distinction between an OOM and an
> incorrect record:
> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz
> 
> Anyway, agreed that it's an improvement to remove this check out of
> allocate_recordbuf().  Noah, are you planning to work more on that?

I can push xl_tot_len-validate-v1.patch, particularly given the testing result
you reported today.  I'm content for my part to stop there.

> >> --- a/src/backend/replication/walreceiver.c
> >> +++ b/src/backend/replication/walreceiver.c
> >> @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
> >>  {
> >>      Assert(tli != 0);
> >>  
> >> +    if (dying)
> >> +    {
> >> +        int buflen = sizeof(uint32);
> >> +        char buf[sizeof(uint32)];
> >> +        
> >> +        memset(buf, 0, buflen);
> >> +        XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli);
> >> +    }
> >> +    
> >>      if (LogstreamResult.Flush < LogstreamResult.Write)
> >>      {
> >>          WalRcvData *walrcv = WalRcv;
> > 
> > That's inexpensive in cycles, and it should make the failures in question
> > quite rare.  If we do that instead of pre-zeroing, I think we'd still want to
> > be ready for invalid lengths, for the case of unclean standby shutdown.
> 
> Tweaking the WAL receiver to do more zeroing has been discussed in the
> past, but as far as I recall we were not completely sure whether it's
> completely free, either.  It seems to me that this should have a
> discussion on its own, in a new thread.  Not sure if we should worry
> about archiving, actually, even if the segments should be padded with
> zeros beforehand.

Okay.

> +   gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false;

Okay.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:
> On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
>> Good idea to pollute the data with recycled segments.  Using a minimal
>> WAL segment size whould help here as well in keeping a test cheap, and
>> two segments should be enough.  The alignment calculations and the
>> header size can be known, but the standby records are an issue for the
>> predictability of the test when it comes to adjust the length of the
>> logical message depending on the 8k WAL page, no?
>
> Could be.  I expect there would be challenges translating that outline into a
> real test, but I don't know if that will be a major one.  The test doesn't
> need to be 100% deterministic.  If it fails 25% of the time and is not the
> slowest test in the recovery suite, I'd find that good enough.

FWIW, I'm having a pretty hard time to get something close enough to a
page border in a reliable way.  Perhaps using a larger series of
records and select only one would be more reliable..  Need to test
that a bit more.

>> FWIW, I came back to this thread while tweaking the error reporting of
>> xlogreader.c for the sake of this thread and this proposal is an
>> improvement to be able to make a distinction between an OOM and an
>> incorrect record:
>> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz
>>
>> Anyway, agreed that it's an improvement to remove this check out of
>> allocate_recordbuf().  Noah, are you planning to work more on that?
>
> I can push xl_tot_len-validate-v1.patch, particularly given the testing result
> you reported today.  I'm content for my part to stop there.

Okay, fine by me.  That's going to help with what I am doing in the
other thread as I'd need to make a better difference between the OOM
and the invalid cases for the allocation path.

You are planning for a backpatch to take care of the inconsistency,
right?  The report has been on 15~ where the prefetching was
introduced.  I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
> On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:
> > On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
> >> Good idea to pollute the data with recycled segments.  Using a minimal
> >> WAL segment size whould help here as well in keeping a test cheap, and
> >> two segments should be enough.  The alignment calculations and the
> >> header size can be known, but the standby records are an issue for the
> >> predictability of the test when it comes to adjust the length of the
> >> logical message depending on the 8k WAL page, no?
> > 
> > Could be.  I expect there would be challenges translating that outline into a
> > real test, but I don't know if that will be a major one.  The test doesn't
> > need to be 100% deterministic.  If it fails 25% of the time and is not the
> > slowest test in the recovery suite, I'd find that good enough.
> 
> FWIW, I'm having a pretty hard time to get something close enough to a
> page border in a reliable way.  Perhaps using a larger series of
> records and select only one would be more reliable..  Need to test
> that a bit more.

Interesting.  So pg_logical_emit_message(false, 'X', repeat('X', n)) doesn't
get close enough, but s/n/n+1/ spills to the next page?  If so, I did not
anticipate that.

> >> FWIW, I came back to this thread while tweaking the error reporting of
> >> xlogreader.c for the sake of this thread and this proposal is an
> >> improvement to be able to make a distinction between an OOM and an
> >> incorrect record:
> >> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz
> >> 
> >> Anyway, agreed that it's an improvement to remove this check out of
> >> allocate_recordbuf().  Noah, are you planning to work more on that?
> > 
> > I can push xl_tot_len-validate-v1.patch, particularly given the testing result
> > you reported today.  I'm content for my part to stop there.
> 
> Okay, fine by me.  That's going to help with what I am doing in the
> other thread as I'd need to make a better difference between the OOM
> and the invalid cases for the allocation path.
> 
> You are planning for a backpatch to take care of the inconsistency,
> right?  The report has been on 15~ where the prefetching was
> introduced.  I'd be OK to just do that and not mess up with the stable
> branches more than necessary (aka ~14) if nobody complains, especially
> REL_11_STABLE planned to be EOL'd in the next minor cycle.

I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15.  I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
> On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
> > You are planning for a backpatch to take care of the inconsistency,
> > right?  The report has been on 15~ where the prefetching was
> > introduced.  I'd be OK to just do that and not mess up with the stable
> > branches more than necessary (aka ~14) if nobody complains, especially
> > REL_11_STABLE planned to be EOL'd in the next minor cycle.
>
> I recall earlier messages theorizing that it was just harder to hit in v14, so
> I'm disinclined to stop at v15.  I think the main choice is whether to stop at
> v11 (normal choice) or v12 (worry about breaking the last v11 point release).
> I don't have a strong opinion between those.

Thanks for working on this.

I wonder if we need a more explicit way to construct pages with the
right bits to reach interesting test cases and get full enough
coverage...  (Cf throwing SQL at the WAL to see if it sticks.)



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
> On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
>> On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
>> I recall earlier messages theorizing that it was just harder to hit in v14, so
>> I'm disinclined to stop at v15.  I think the main choice is whether to stop at
>> v11 (normal choice) or v12 (worry about breaking the last v11 point release).
>> I don't have a strong opinion between those.

Okay.  I wouldn't be inclined to patch v11 for that, FWIW, as this
code path is touched by recovery and more.  At least it does not seem
worth taking any risk compared to the potential gain.

> Thanks for working on this.
>
> I wonder if we need a more explicit way to construct pages with the
> right bits to reach interesting test cases and get full enough
> coverage...  (Cf throwing SQL at the WAL to see if it sticks.)

You mean SQL functions that write an arbitrary set of bytes at a given
LSN, to trigger some expected behavior on a follow-up crash recovery?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
>> On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
>>> I recall earlier messages theorizing that it was just harder to hit in v14, so
>>> I'm disinclined to stop at v15.  I think the main choice is whether to stop at
>>> v11 (normal choice) or v12 (worry about breaking the last v11 point release).
>>> I don't have a strong opinion between those.

> Okay.  I wouldn't be inclined to patch v11 for that, FWIW, as this
> code path is touched by recovery and more.  At least it does not seem
> worth taking any risk compared to the potential gain.

That was my gut reaction too -- risk/reward seems not great for the
last v11 release.  (I've been burned a couple of times now by putting
can't-fix-it-anymore bugs into the final release of a branch, so maybe
I'm just being overly paranoid.  But it's something to worry about.)

            regards, tom lane



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Fri, Aug 11, 2023 at 08:32:00PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote:
> >> On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote:
> >>> I recall earlier messages theorizing that it was just harder to hit in v14, so
> >>> I'm disinclined to stop at v15.  I think the main choice is whether to stop at
> >>> v11 (normal choice) or v12 (worry about breaking the last v11 point release).
> >>> I don't have a strong opinion between those.
> 
> > Okay.  I wouldn't be inclined to patch v11 for that, FWIW, as this
> > code path is touched by recovery and more.  At least it does not seem 
> > worth taking any risk compared to the potential gain.
> 
> That was my gut reaction too -- risk/reward seems not great for the
> last v11 release.  (I've been burned a couple of times now by putting
> can't-fix-it-anymore bugs into the final release of a branch, so maybe
> I'm just being overly paranoid.  But it's something to worry about.)

Okay, v12+ would be fine with me.  Alas, a closer look at v1 found two
defects, one with severity far greater than the bug v1 aimed to fix:


===== 1. For oversized records, it changed startup FATAL to silent data loss

This affects only supported branches (v15-).  In v16+, commit 8fcb32d
introduced and checked XLogRecordMaxSize.  But in v15 with patch v1, redo of
pg_logical_emit_message(false, '_', repeat('x', 1024 * 1024 * 1024 - 1000))
just ends at the oversized record:

2523581 2023-08-12 17:17:39.512 GMT LOG:  redo starts at 0/41F155B8
2523581 2023-08-12 17:17:39.512 GMT LOG:  record length 1073740879 at 0/41F155F0 too long
2523581 2023-08-12 17:17:39.512 GMT LOG:  redo done at 0/41F155B8 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed:0.00 s
 

Any user could call pg_logical_emit_message() to silently terminate the WAL
stream, which is far worse than the original bug.  So far, I'm seeing one way
to clearly fix $SUBJECT without that harm.  When a record header spans a page
boundary, read the next page to reassemble the header.  If
!ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
Otherwise, read the whole record in chunks, calculating CRC.  If CRC is
invalid, treat as end of WAL.  Otherwise, ereport(FATAL) for the oversized
record.  A side benefit would be avoiding useless large allocations (1GB
backend, 4GB frontend) as discussed upthread.  May as well do the xl_rmid and
xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
allocations.  Thoughts?  For invalid-length records in v16+, since every such
record is end-of-wal or corruption, those versions could skip the CRC.

As an alternative, zeroing recycled pages could remove "most of" the spurious
errors without adding silent data loss.  I considered another alternative of
back-patching XLogRecordMaxSize and proceeding with a patch like v1, but that
wouldn't help with records in existing WAL archives.  Of course, we could also
choose to leave $SUBJECT unfixed in v15-.


===== 2. DecodeXLogRecordRequiredSpace() is the wrong ceiling

v1 treats DecodeXLogRecordRequiredSpace(L) (=L+2445 here) as the max needed,
but allocate_recordbuf(state, L) allocates up to L+8192.  If the uninitialized
length happened to contain an integer between about 1<<30-8192 and 1<<30-2445,
one still got e.g. "FATAL:  invalid memory alloc request size 1073741824".
Since v16+ has XLogRecordMaxSize, testing XLogRecordMaxSize avoids the problem
and should suffice:

===
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -640,7 +640,6 @@ restart:
                               (uint32) SizeOfXLogRecord, total_len);
         goto err;
     }
-#ifndef FRONTEND
 
     /*
      * We may be looking at a random uint32 read from a recycled segment.  For
@@ -651,13 +650,12 @@ restart:
      * the allocation may succeed but record checks are going to fail so this
      * would be short-lived.
      */
-    if (!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len)))
+    if (total_len > XLogRecordMaxSize)
     {
         report_invalid_record(state, "record length %u at %X/%X too long",
                               total_len, LSN_FORMAT_ARGS(RecPtr));
         goto err;
     }
-#endif
 
     /*
      * If the whole record header is on this page, validate it immediately.
===

allocate_recordbuf()'s addend is not critical; it's just to "avoid useless
small increases".  v15- should avoid the problem like this:

   ===
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 969bcc3..1bac303 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -201,4 +201,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
     newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
     newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
+#ifndef FRONTEND
+    if (!AllocSizeIsValid(newSize))
+        newSize = reclength;
+#endif
 
     if (state->readRecordBuf)
  ===

In v15, one can reach this case with pg_logical_emit_message(false, '_',
repeat('x', 1024 * 1024 * 1024 - 4000)).  On v16+, one might reach this with
unlucky trailing garbage, but XLogRecordMaxSize prevents reaching it through
pg_logical_emit_message().


Thanks,
nm



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
> Any user could call pg_logical_emit_message() to silently terminate the WAL
> stream, which is far worse than the original bug.  So far, I'm seeing one way
> to clearly fix $SUBJECT without that harm.  When a record header spans a page
> boundary, read the next page to reassemble the header.  If
> !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
> Otherwise, read the whole record in chunks, calculating CRC.  If CRC is
> invalid, treat as end of WAL.  Otherwise, ereport(FATAL) for the oversized
> record.  A side benefit would be avoiding useless large allocations (1GB
> backend, 4GB frontend) as discussed upthread.  May as well do the xl_rmid and
> xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
> allocations.  Thoughts?  For invalid-length records in v16+, since every such
> record is end-of-wal or corruption, those versions could skip the CRC.

That sounds quite strong.  But... why couldn't the existing
xlp_rem_len cross-check protect us from this failure mode?  If we
could defer the allocation until after that check (and the usual
ValidXLogRecordHeader() check), I think we'd know that we're really
looking at a size that was actually written in both pages (which must
also have survived xlp_pageaddr check), no?



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:
> On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
> > Any user could call pg_logical_emit_message() to silently terminate the WAL
> > stream, which is far worse than the original bug.  So far, I'm seeing one way
> > to clearly fix $SUBJECT without that harm.  When a record header spans a page
> > boundary, read the next page to reassemble the header.  If
> > !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
> > Otherwise, read the whole record in chunks, calculating CRC.  If CRC is
> > invalid, treat as end of WAL.  Otherwise, ereport(FATAL) for the oversized
> > record.  A side benefit would be avoiding useless large allocations (1GB
> > backend, 4GB frontend) as discussed upthread.  May as well do the xl_rmid and
> > xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
> > allocations.  Thoughts?  For invalid-length records in v16+, since every such
> > record is end-of-wal or corruption, those versions could skip the CRC.
> 
> That sounds quite strong.  But... why couldn't the existing
> xlp_rem_len cross-check protect us from this failure mode?  If we
> could defer the allocation until after that check (and the usual
> ValidXLogRecordHeader() check), I think we'd know that we're really
> looking at a size that was actually written in both pages (which must
> also have survived xlp_pageaddr check), no?

Hmm, I think that is right.  A coincidental match of the 32-bit CRC is more
probable than having all those fields appear valid by coincidence, especially
xlp_pageaddr.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Aug 12, 2023 at 08:30:34PM -0700, Noah Misch wrote:
> On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote:
>> On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote:
>>> Any user could call pg_logical_emit_message() to silently terminate the WAL
>>> stream, which is far worse than the original bug.  So far, I'm seeing one way
>>> to clearly fix $SUBJECT without that harm.  When a record header spans a page
>>> boundary, read the next page to reassemble the header.  If
>>> !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL.
>>> Otherwise, read the whole record in chunks, calculating CRC.  If CRC is
>>> invalid, treat as end of WAL.  Otherwise, ereport(FATAL) for the oversized
>>> record.  A side benefit would be avoiding useless large allocations (1GB
>>> backend, 4GB frontend) as discussed upthread.  May as well do the xl_rmid and
>>> xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1
>>> allocations.  Thoughts?  For invalid-length records in v16+, since every such
>>> record is end-of-wal or corruption, those versions could skip the CRC.
>>
>> That sounds quite strong.  But... why couldn't the existing
>> xlp_rem_len cross-check protect us from this failure mode?  If we
>> could defer the allocation until after that check (and the usual
>> ValidXLogRecordHeader() check), I think we'd know that we're really
>> looking at a size that was actually written in both pages (which must
>> also have survived xlp_pageaddr check), no?

+1 for relying on xlp_rem_len for that.

> Hmm, I think that is right.  A coincidental match of the 32-bit CRC is more
> probable than having all those fields appear valid by coincidence, especially
> xlp_pageaddr.

Hmm.  [.. thinks ..]

It seems to me that we should try to also finish the header validation
before attempting XLogReadRecordAlloc() on the total_len as well?  It
looks like the end result would be to move the first ReadPageInternal
done for the header with all its cross-page checks before
XLogReadRecordAlloc().  That should remove the need of having
gotheader from v1.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Aug 14, 2023 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:
> It seems to me that we should try to also finish the header validation
> before attempting XLogReadRecordAlloc() on the total_len as well?  It
> looks like the end result would be to move the first ReadPageInternal
> done for the header with all its cross-page checks before
> XLogReadRecordAlloc().  That should remove the need of having
> gotheader from v1.

Yeah.  Trying out an idea for how to untangle that...



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Aug 14, 2023 at 3:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Aug 14, 2023 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:
> > It seems to me that we should try to also finish the header validation
> > before attempting XLogReadRecordAlloc() on the total_len as well?  It
> > looks like the end result would be to move the first ReadPageInternal
> > done for the header with all its cross-page checks before
> > XLogReadRecordAlloc().  That should remove the need of having
> > gotheader from v1.
>
> Yeah.  Trying out an idea for how to untangle that...

Something like this... patches for 12 and master attached.  Needs a
lot more testing.  I really want to figure out how to get
deterministic tests...

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Aug 14, 2023 at 10:28:54PM +1200, Thomas Munro wrote:
> Something like this... patches for 12 and master attached.  Needs a
> lot more testing.  I really want to figure out how to get
> deterministic tests...

Interesting, thanks for the quick patch!

    /*
     * Try to find space to decode this record, if we can do so without
     * calling palloc.  If we can't, we'll try again below after we've
     * validated that total_len isn't garbage bytes from a recycled WAL page.
     */
    decoded = XLogReadRecordAlloc(state,
                                  total_len,
                                  false /* allow_oversized */ );

The patch relies on total_len before the header validation is
completed, meaning that this value of total_len could be invalid
because it comes from the partially-read header..  Oh wait, that's
actually OK because an oversized allocation is not allowed yet and the
decode buffer would not be able to store more than what it can?
Perhaps this comment should be updated to tell something like, adding
that total_len can be garbage, but we don't care because we don't
allocate anything that the decode buffer cannot hold.

 #ifndef FRONTEND

-    /*
-     * Note that in much unlucky circumstances, the random data read from a
-     * recycled segment can cause this routine to be called with a size
-     * causing a hard failure at allocation.  For a standby, this would cause
-     * the instance to stop suddenly with a hard failure, preventing it to
-     * retry fetching WAL from one of its sources which could allow it to move
-     * on with replay without a manual restart. If the data comes from a past
-     * recycled segment and is still valid, then the allocation may succeed
-     * but record checks are going to fail so this would be short-lived.  If
-     * the allocation fails because of a memory shortage, then this is not a
-     * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
-     */
     if (!AllocSizeIsValid(newSize))
         return false;

Wouldn't it be OK to drop entirely this check?  I'm fine to keep it as
a safety measure, but it should not be necessary now that it gets
called only once the header is validated?

Regarding the tests, I have been using this formula to produce the
number of bytes until the next page boundary:
select setting::int - ((pg_current_wal_lsn() - '0/0') % setting::int)
  from pg_settings where name = 'wal_block_size';

Using pg_logical_emit_message() non-transactional with an empty set of
strings generates records of 56 bytes, so I was thinking about the
following:
- Generate a bunch of small records with pg_logical_emit_message(), or
records based on the threshold with the previous formula.
- Loop until we reach a page limit, at 24 bytes (?).
- Generate one last record to cut through.
- Stop the node in immediate mode.
- Write some garbage bytes on the last page generated to emulate the
recycled contents and an allocation
- Start the node, which should be able to startup.
With wal_level = minimal, autovacuum = off and a large
checkpoint_timeout, any other records are minimized.  That's fancy,
though.

Do you think that this could work reliably?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Tue, Aug 15, 2023 at 2:05 PM Michael Paquier <michael@paquier.xyz> wrote:
>     /*
>      * Try to find space to decode this record, if we can do so without
>      * calling palloc.  If we can't, we'll try again below after we've
>      * validated that total_len isn't garbage bytes from a recycled WAL page.
>      */
>     decoded = XLogReadRecordAlloc(state,
>                                   total_len,
>                                   false /* allow_oversized */ );
>
> The patch relies on total_len before the header validation is
> completed, meaning that this value of total_len could be invalid
> because it comes from the partially-read header..  Oh wait, that's
> actually OK because an oversized allocation is not allowed yet and the
> decode buffer would not be able to store more than what it can?
> Perhaps this comment should be updated to tell something like, adding
> that total_len can be garbage, but we don't care because we don't
> allocate anything that the decode buffer cannot hold.

Yeah.

>  #ifndef FRONTEND
>
> -       /*
> -        * Note that in much unlucky circumstances, the random data read from a
> -        * recycled segment can cause this routine to be called with a size
> -        * causing a hard failure at allocation.  For a standby, this would cause
> -        * the instance to stop suddenly with a hard failure, preventing it to
> -        * retry fetching WAL from one of its sources which could allow it to move
> -        * on with replay without a manual restart. If the data comes from a past
> -        * recycled segment and is still valid, then the allocation may succeed
> -        * but record checks are going to fail so this would be short-lived.  If
> -        * the allocation fails because of a memory shortage, then this is not a
> -        * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
> -        */
>         if (!AllocSizeIsValid(newSize))
>                 return false;
>
> Wouldn't it be OK to drop entirely this check?  I'm fine to keep it as
> a safety measure, but it should not be necessary now that it gets
> called only once the header is validated?

Yeah, now we're in "shouldn't happen" territory, and I'm not sure.

> Regarding the tests, I have been using this formula to produce the
> number of bytes until the next page boundary:
> select setting::int - ((pg_current_wal_lsn() - '0/0') % setting::int)
>   from pg_settings where name = 'wal_block_size';
>
> Using pg_logical_emit_message() non-transactional with an empty set of
> strings generates records of 56 bytes, so I was thinking about the
> following:
> - Generate a bunch of small records with pg_logical_emit_message(), or
> records based on the threshold with the previous formula.
> - Loop until we reach a page limit, at 24 bytes (?).
> - Generate one last record to cut through.
> - Stop the node in immediate mode.
> - Write some garbage bytes on the last page generated to emulate the
> recycled contents and an allocation
> - Start the node, which should be able to startup.
> With wal_level = minimal, autovacuum = off and a large
> checkpoint_timeout, any other records are minimized.  That's fancy,
> though.
>
> Do you think that this could work reliably?

Yeah, I think that sounds quite promising, and funnily enough I was
just now working on some Perl code that appends controlled junk to the
WAL in a test like that so we can try to hit all the error paths...



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Aug 15, 2023 at 02:36:10PM +1200, Thomas Munro wrote:
> Yeah, I think that sounds quite promising, and funnily enough I was
> just now working on some Perl code that appends controlled junk to the
> WAL in a test like that so we can try to hit all the error paths...

I am pretty sure that adding some junk in a controlled manner is the
only cheap and rather-reliable way to get something..

Not sure if that will help, but what I was playing with some stuff in
the lines of:
-- Store the length up to page boundary.
select setting::int - ((pg_current_wal_insert_lsn() - '0/0') %
  setting::int) as boundary from pg_settings where name = 'wal_block_size'
  \gset
-- Generate record up to boundary (56 bytes for base size of the record,
-- stop at 12 bytes before the end of the page.
select pg_logical_emit_message(false, '', repeat('a', :boundary - 56 - 12));

Then by injecting some FF's on the last page written and forcing
replay I am able to force some of the error code paths, so I guess
that's what you were basically doing?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Aug 15, 2023 at 12:00:30PM +0900, Michael Paquier wrote:
> Not sure if that will help, but what I was playing with some stuff in
> the lines of:
> -- Store the length up to page boundary.
> select setting::int - ((pg_current_wal_insert_lsn() - '0/0') %
>   setting::int) as boundary from pg_settings where name = 'wal_block_size'
>   \gset
> -- Generate record up to boundary (56 bytes for base size of the record,
> -- stop at 12 bytes before the end of the page.
> select pg_logical_emit_message(false, '', repeat('a', :boundary - 56 - 12));
>
> Then by injecting some FF's on the last page written and forcing
> replay I am able to force some of the error code paths, so I guess
> that's what you were basically doing?

I've been spending some extra time on this one and hacked a TAP test
that reliably reproduces the original issue, using a message similar
to what I mentioned in my previous messages.  I guess that we could
use something like that:
2023-08-15 15:07:03.790 JST [8729] LOG:  redo starts at 0/14EA428
2023-08-15 15:07:03.790 JST [8729] FATAL:  invalid memory alloc
request size 4294969740 2023-08-15
15:07:03.791 JST [8726] LOG:  startup process (PID 8729) exited with exit code 1

The proposed patches pass the test, HEAD does not.  We may want to do
more with page boundaries, and more error patterns, but the idea looks
worth exploring more.  At least this can be used to validate patches.

I've noticed while hacking the test that we don't do a XLogFlush()
after inserting the message's record, so we may lose it on crash.
That makes the test unstable except if an extra record is added after
the logical messages.  The attached patch forces that for the sake of
the test, but I'm spawning a different thread as losing this data
looks like a bug to me.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Tue, Aug 15, 2023 at 6:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> I've been spending some extra time on this one and hacked a TAP test
> that reliably reproduces the original issue, using a message similar
> to what I mentioned in my previous messages.

Nice.

I hacked on this idea for quite a long time yesterday and today and
came up with a set of tests for the main end-of-WAL conditions:

▶ 1/1 - xl_tot_len zero                                  OK
▶ 1/1 - xl_tot_len short                                 OK
▶ 1/1 - xl_prev bad                                      OK
▶ 1/1 - xl_crc bad                                       OK
▶ 1/1 - xlp_magic zero                                   OK
▶ 1/1 - xlp_magic bad                                    OK
▶ 1/1 - xlp_pageaddr bad                                 OK
▶ 1/1 - xlp_info bad                                     OK
▶ 1/1 - xlp_info lacks XLP_FIRST_IS_CONTRECORD           OK
▶ 1/1 - xlp_rem_len bad                                  OK
▶ 1/1 - xlp_magic zero (split record header)             OK
▶ 1/1 - xlp_pageaddr bad (split record header)           OK
▶ 1/1 - xlp_rem_len bad (split record header)            OK
1/1 postgresql:recovery / recovery/038_end_of_wal        OK
    5.79s   13 subtests passed

It took me a while to come up with a workable way to get into the
record-header-splitting zone.  Based on some of your clues about
flushing, I eventually realised I needed transactional messages, and I
built a kind of self-calibrating Rube Goldberg function around that.
It's terrible, and I'm sure we can do better.

I wonder what people think about putting internal details of the WAL
format into a Perl test like this.  Obviously it requires maintenance,
since it knows the size and layout of a few things.  I guess it'd be
allowed to fish a couple of those numbers out of the source.

Work in progress...  I'm sure more useful checks could be added.  One
thing that occurred to me while thinking about all this it that the
'treat malloc failure as end of WAL' thing you highlighted in another
thread is indeed completely bananas -- I didn't go digging, but
perhaps it was an earlier solution to the very same garbage xl_tot_len
problem, before 70b4f82a4b5 and now this xl_rem_len-based solution?

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Wed, Aug 16, 2023 at 03:29:49PM +1200, Thomas Munro wrote:
> I hacked on this idea for quite a long time yesterday and today and
> came up with a set of tests for the main end-of-WAL conditions:
>
> ▶ 1/1 - xl_tot_len zero                                  OK
> ▶ 1/1 - xl_tot_len short                                 OK
> ▶ 1/1 - xl_prev bad                                      OK
> ▶ 1/1 - xl_crc bad                                       OK
> ▶ 1/1 - xlp_magic zero                                   OK
> ▶ 1/1 - xlp_magic bad                                    OK
> ▶ 1/1 - xlp_pageaddr bad                                 OK
> ▶ 1/1 - xlp_info bad                                     OK
> ▶ 1/1 - xlp_info lacks XLP_FIRST_IS_CONTRECORD           OK
> ▶ 1/1 - xlp_rem_len bad                                  OK
> ▶ 1/1 - xlp_magic zero (split record header)             OK
> ▶ 1/1 - xlp_pageaddr bad (split record header)           OK
> ▶ 1/1 - xlp_rem_len bad (split record header)            OK
> 1/1 postgresql:recovery / recovery/038_end_of_wal        OK
>     5.79s   13 subtests passed

Nice.

> It took me a while to come up with a workable way to get into the
> record-header-splitting zone.  Based on some of your clues about
> flushing, I eventually realised I needed transactional messages, and I
> built a kind of self-calibrating Rube Goldberg function around that.
> It's terrible, and I'm sure we can do better.

So that's advance_to_record_splitting_zone(), with the idea to loop
across multiple records.  The use of transactional messages ensures
the flush and the stability of the test when the server is stopped,
but it makes the records a bit longer than the non-transactional
messages.

> I wonder what people think about putting internal details of the WAL
> format into a Perl test like this.  Obviously it requires maintenance,
> since it knows the size and layout of a few things.  I guess it'd be
> allowed to fish a couple of those numbers out of the source.

The format of the WAL header does not change a lot across the years,
so I'd be OK with that.  If that's an issue, the scope of the test
could always be restricted a bit.

> Work in progress...  I'm sure more useful checks could be added.  One
> thing that occurred to me while thinking about all this it that the
> 'treat malloc failure as end of WAL' thing you highlighted in another
> thread is indeed completely bananas -- I didn't go digging, but
> perhaps it was an earlier solution to the very same garbage xl_tot_len
> problem, before 70b4f82a4b5 and now this xl_rem_len-based solution?

It is bananas.  Still, the problem of the other thread is similar to
what's here, and different at the same time because it can happen with
a valid record if the host is under memory pressure, and we should
still allow a standby to loop and continue in this case.  Perhaps also
a startup process doing only crash recovery.  This thread's problem is
dependent on this one, which is why this one needs to happen first: we
need to clarify the boundary between a real OOM and garbage indicating
the end of WAL.

+my $RECORD_HEADER_SIZE = 24;
+my $XLP_PAGE_MAGIC = 0xd113;
+my $XLP_FIRST_IS_CONTRECORD = 0x1;

This could use something like check_pg_config() to find the header
contents for the last two ones, but there's no easy way to extract the
size of the header struct, is there?  Or perhaps perl has a tool for
that..  Anything I can read on the matter tells to use pack() to make
the size estimations predictible.

+   my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
+   while ($page_offset >= $WAL_BLOCK_SIZE - 2000)
+   {
+       $end_lsn = emit_message($node, 2000)

The hardcoded 2000 feels magic here, but I get that you just want to
get away from the page border, as well.  I'd be OK with more comments.

+sub lsn_to_segment_and_offset
+{
+   my $lsn = shift;
+   return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
+}

Now the backend also has pg_walfile_name_offset().  I tend to rely on
the backend logic as much as I can for tests.

+sub start_of_page

Not sure this one is needed, only being called in
start_of_next_page().

+sub get_insert_lsn

Name is confusing with Cluster::lsn('insert') available, perhaps this
should be named get_insert_lsn_in_bytes.

As a whole, I find that pretty cool.  Not sure if I would backpatch
the test, so my opinion would be to use that on HEAD and let it
stabilize there.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Wed, Aug 16, 2023 at 03:17:58PM +0900, Michael Paquier wrote:
> As a whole, I find that pretty cool.  Not sure if I would backpatch
> the test, so my opinion would be to use that on HEAD and let it
> stabilize there.

Thomas, is there anything I can do to help with this patch and make it
move on?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Fri, Aug 18, 2023 at 12:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Aug 16, 2023 at 03:17:58PM +0900, Michael Paquier wrote:
> > As a whole, I find that pretty cool.  Not sure if I would backpatch
> > the test, so my opinion would be to use that on HEAD and let it
> > stabilize there.
>
> Thomas, is there anything I can do to help with this patch and make it
> move on?

Give me a couple of days and I'll look into how back-patchable the
tests can be made, and see what else we can test.  Perhaps it's not
strictly necessary to back-patch the fix further than 15, but I think
we should definitely consider it, and I don't like the idea of not
having the tests accompanying the change.

If you have any ideas about how to write a more efficient version of
advance_to_record_splitting_zone() (or I guess that should really be
advance_to_record_header_splitting_zone()), and generally how to make
the perl better, and how to get those constants we need from the
source or binaries, then I'm all ears.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Fri, Aug 18, 2023 at 02:30:31PM +1200, Thomas Munro wrote:
> Give me a couple of days and I'll look into how back-patchable the
> tests can be made, and see what else we can test.  Perhaps it's not
> strictly necessary to back-patch the fix further than 15, but I think
> we should definitely consider it, and I don't like the idea of not
> having the tests accompanying the change.

Okay, cool.

> If you have any ideas about how to write a more efficient version of
> advance_to_record_splitting_zone() (or I guess that should really be
> advance_to_record_header_splitting_zone()), and generally how to make
> the perl better,

Yeah, I think that there are ways to reduce the number of records
generated.

> and how to get those constants we need from the
> source or binaries, then I'm all ears.

A new perl routine able to do a pg_control --includedir that scans a
defined header with a regexp would be able to make the job for the
constants.  If you're interested, I can code that up.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Fri, Aug 18, 2023 at 01:55:32PM +0900, Michael Paquier wrote:
> A new perl routine able to do a pg_control --includedir that scans a
> defined header with a regexp would be able to make the job for the
> constants.  If you're interested, I can code that up.

Please find attached a patch, that applies on top of the TAP test,
to implement a logic able to extract from header files the values
wanted.  This is Utils::scan_header(), for example:
my @scan_result = scan_header('server/access/xlog_internal.h',
    '#define\s+XLOG_PAGE_MAGIC\s+(\w+)');
my $XLP_PAGE_MAGIC = hex($scan_result[0]);

This uses group capture so as it is possible to get more than one
field.  Guessing the size of the C structure is possible, but I've
found these options to be a bit awkward implementation-wise:
- Config would not help for the Postgres-specific declarations, like
TimeLineId, etc.
- I saw something that was specific to GCC..  cannot put my finger on
it now.

Another possibility would be to store the size in a different variable
in xlog_internal.h, coupled with a StaticAssertDecl() to make sure
that it matches with the real size.  While it has the same maintenance
cost as RECORD_HEADER_SIZE in the test itself, that's catching a
mismatching size at compilation-time rather than at test-time.

I am not sure that I'll be able to do more on this topic this week, at
least that's some progress.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Aug 21, 2023 at 08:32:39AM +0900, Michael Paquier wrote:
> I am not sure that I'll be able to do more on this topic this week, at
> least that's some progress.

Some time later..

I have spent more time and thoughts on the whole test suite, finishing
with the attached 0003 that applies on top of your own patches.  I am
really looking forward to making this whole logic more robust, so as
WAL replay can be made itself more robust for the OOM/end-of-wal
detection on HEAD for standbys and crash recovery.

While looking at the whole, I have considered a few things that may
make the test cleaner, like:
- Calculating the segment name and its offset from the end_lsn of a
record directly from the backend, but it felt inelegant to pass
through more subroutine layers the couple ($segment, offset) rather
than just a LSN, so guessing the segment number and the offset while
the cluster is offline if OK by me.
- The TLI can be queried from the server rather than hardcoded.
- I've been thinking about bundling the tests of each sub-section in
their own subroutine, but that felt a bit awkward, particularly for
the part where we need a correct $prev_lsn in the record header
written to enforce other code paths.
- The test needs better documentation.  One of the things I kept
staring at is cross-checking pack() and the dependency to the C
structures, so I have added more details there, explaining more the
whys and the hows.

I have also looked again at the C code for a few hours, and still got
the impression that this is rather solid.  There are two things that
may be better:
- Document at the top of allocate_recordbuf() that this should never
be called with a length coming from a header until it is validated.
- Removing AllocSizeIsValid() for the non-FRONTEND path should be OK.

What do you think?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Sergei Kornilov
Date:
Hello
I have an error from the new test due to Test/Utils.pm new scan_header function.

I applied all 3 patches and use

./configure --prefix=/home/melkij/tmp/pgdev/inst --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3
-fno-omit-frame-pointer"--enable-tap-tests
 
make -s world
make check-world

Then I see:

Test Summary Report
-------------------
t/038_end_of_wal.pl                (Wstat: 512 (exited 2) Tests: 0 Failed: 0)
  Non-zero exit status: 2
  Parse errors: No plan found in TAP output
  
Corresponding tmp_check/log/regress_log_038_end_of_wal show error:

No such file or directory at
/home/melkij/tmp/srcs/postgresql/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Utils.pmline 725.
 

pg_config from scan_header function returns
/home/melkij/tmp/srcs/postgresql/tmp_install/home/melkij/tmp/pgdev/inst/include, then the function tries to concatenate
server/access/xlog_internal.h,while the required file is in postgresql/server/access/xlog_internal.h from this
directory.

If I change the test like this for both scan_header, then it succeeds.

- my @scan_result = scan_header('server/access/xlog_internal.h',
+ my @scan_result = scan_header('postgresql/server/access/xlog_internal.h',

Is there something wrong with my setup?

regards, Sergei



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Sep 02, 2023 at 02:51:48PM +0300, Sergei Kornilov wrote:
> Corresponding tmp_check/log/regress_log_038_end_of_wal show error:
>
> No such file or directory at
/home/melkij/tmp/srcs/postgresql/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Utils.pmline 725. 
>
> pg_config from scan_header function returns
> /home/melkij/tmp/srcs/postgresql/tmp_install/home/melkij/tmp/pgdev/inst/include
> , then the function tries to concatenate
> server/access/xlog_internal.h, while the required file is in
> postgresql/server/access/xlog_internal.h from this directory.
>
> If I change the test like this for both scan_header, then it succeeds.
>
> - my @scan_result = scan_header('server/access/xlog_internal.h',
> + my @scan_result = scan_header('postgresql/server/access/xlog_internal.h',
>
> Is there something wrong with my setup?

I think that's just a mistake I did, as there should not even be a
need for having to specify "server/" for the header path.  If you
switch scan_header() in Utils.pm to use --includedir-server rather
than --includedir, does it work correctly for you?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sun, Sep 03, 2023 at 08:42:55AM +0900, Michael Paquier wrote:
> I think that's just a mistake I did, as there should not even be a
> need for having to specify "server/" for the header path.  If you
> switch scan_header() in Utils.pm to use --includedir-server rather
> than --includedir, does it work correctly for you?

In short, I mean something like the v4 attached.  Feel free to use it
for your tests.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Sergei Kornilov
Date:
Hello

Thank you, this version passes check-world for me.

Looks good (but I'm certainly not a big expert on xlogreader), a solid set of tests for different cases.

regards, Sergei



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
Thanks for looking/testing, Sergei.  Thanks for the changes, Michael,
these all look good.  I've squashed them and added you as co-author.

A couple more small comment/text changes:

1.  In the place where we fail to allocate memory for an oversized
record, I copied the comment about treating that as a "bogus data"
condition.  I suspect that we will soon be converting that to a FATAL
error[1], and that'll need to be done in both places.

2.  In this version of the commit message I said we'd only back-patch
to 15 for now.  After sleeping on this for a week, I realised that the
reason I keep vacillating on that point is that I am not sure what
your plan is for the malloc-failure-means-end-of-wal policy ([1],
ancient code from 0ffe11abd3a).  If we're going to fix that in master
only but let sleeping dogs lie in the back-branches, then it becomes
less important to go back further than 15 with THIS patch.  But if you
want to be able to distinguish garbage from out-of-memory, and thereby
end-of-wal from a FATAL please-insert-more-RAM condition, I think
you'd really need this industrial strength validation in all affected
branches, and I'd have more work to do, right?  The weak validation we
are fixing here is the *real* underlying problem going back many
years, right?

I also wondered about strengthening the validation of various things
like redo begin/end LSNs etc in these tests.  But we can always
continue to improve all this later...

Here also is a version for 15 (and a CI run[2]), since we tweaked many
of the error messages between 15 and 16.

[1] https://www.postgresql.org/message-id/flat/ZMh/WV%2BCuknqePQQ%40paquier.xyz
[2] https://cirrus-ci.com/task/4533280897236992 (failed on some
unrelated pgbench test, reported in another thread)

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Sep 04, 2023 at 03:20:31PM +1200, Thomas Munro wrote:
> 1.  In the place where we fail to allocate memory for an oversized
> record, I copied the comment about treating that as a "bogus data"
> condition.  I suspect that we will soon be converting that to a FATAL
> error[1], and that'll need to be done in both places.

You mean for the two callers of XLogReadRecordAlloc(), even for the
case where !allow_oversized?  Using a FATAL on non-FRONTEND would be
the quickest fix, indeed, but there are argument for standbys where we
could let these continue, as well.  That would be an improvement over
the always-FATAL on OOM, of course.

> 2.  In this version of the commit message I said we'd only back-patch
> to 15 for now.  After sleeping on this for a week, I realised that the
> reason I keep vacillating on that point is that I am not sure what
> your plan is for the malloc-failure-means-end-of-wal policy ([1],
> ancient code from 0ffe11abd3a).  If we're going to fix that in master
> only but let sleeping dogs lie in the back-branches, then it becomes
> less important to go back further than 15 with THIS patch.

For HEAD, my plan was to give more control to the callers of the WAL
reader, which is not something that could happen in back-branches
because it requires a few structural changes in the reporting the
reader is able to do.  That's the standby vs crash-recovery behavior,
where we can make the startup process not FATAL all the time because
it should just loop forever if it cannot process a record.  It does
not fix the root issue discussed here, of course, but it make the OOM
handling more friendly in some replay cases.

> But if you
> want to be able to distinguish garbage from out-of-memory, and thereby
> end-of-wal from a FATAL please-insert-more-RAM condition, I think
> you'd really need this industrial strength validation in all affected
> branches, and I'd have more work to do, right?  The weak validation we
> are fixing here is the *real* underlying problem going back many
> years, right?

Getting the same validation checks for all the branches would be nice.
FATAL-ing on OOM to force recovery to happen again is a better option
than assuming that it is the end of recovery.  I am OK to provide
patches for all the branches for the sake of this thread, if that
helps.  Switching to a hard FATAL on OOM for the WAL reader in the
backend is backpatchable, but I'd rather consider that on a different
thread once the better checks for the record header are in place.

> I also wondered about strengthening the validation of various things
> like redo begin/end LSNs etc in these tests.  But we can always
> continue to improve all this later...

Agreed.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Sep 4, 2023 at 3:54 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Sep 04, 2023 at 03:20:31PM +1200, Thomas Munro wrote:
> > 1.  In the place where we fail to allocate memory for an oversized
> > record, I copied the comment about treating that as a "bogus data"
> > condition.  I suspect that we will soon be converting that to a FATAL
> > error[1], and that'll need to be done in both places.
>
> You mean for the two callers of XLogReadRecordAlloc(), even for the
> case where !allow_oversized?  Using a FATAL on non-FRONTEND would be
> the quickest fix, indeed, but there are argument for standbys where we
> could let these continue, as well.  That would be an improvement over
> the always-FATAL on OOM, of course.

I just mean the two places where "bogus data" is mentioned in that v5 patch.

> > But if you
> > want to be able to distinguish garbage from out-of-memory, and thereby
> > end-of-wal from a FATAL please-insert-more-RAM condition, I think
> > you'd really need this industrial strength validation in all affected
> > branches, and I'd have more work to do, right?  The weak validation we
> > are fixing here is the *real* underlying problem going back many
> > years, right?
>
> Getting the same validation checks for all the branches would be nice.
> FATAL-ing on OOM to force recovery to happen again is a better option
> than assuming that it is the end of recovery.  I am OK to provide
> patches for all the branches for the sake of this thread, if that
> helps.  Switching to a hard FATAL on OOM for the WAL reader in the
> backend is backpatchable, but I'd rather consider that on a different
> thread once the better checks for the record header are in place.

OK, so it sounds like you want to go back to 12.  Let me see if I can
get this TAP test to work in 12... more tomorrow.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Sep 04, 2023 at 07:17:21PM +1200, Thomas Munro wrote:
> I just mean the two places where "bogus data" is mentioned in that v5 patch.

Got it, thanks.

> OK, so it sounds like you want to go back to 12.  Let me see if I can
> get this TAP test to work in 12... more tomorrow.

If you're restricted on time, I am OK to deep-dive into these branches
as well ;)
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Sep 4, 2023 at 7:28 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Sep 04, 2023 at 07:17:21PM +1200, Thomas Munro wrote:
> > OK, so it sounds like you want to go back to 12.  Let me see if I can
> > get this TAP test to work in 12... more tomorrow.
>
> If you're restricted on time, I am OK to deep-dive into these branches
> as well ;)

I am behind on several things, so if you want to, please go ahead!  I
already posted a 15 version and a master version (which should also
apply and work on 16), so we'd need 12, 13, 14.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Sep 04, 2023 at 07:43:59PM +1200, Thomas Munro wrote:
> I am behind on several things, so if you want to, please go ahead!  I
> already posted a 15 version and a master version (which should also
> apply and work on 16), so we'd need 12, 13, 14.

And here you go for all the three branches, with all the tests
passing.  This change is complicated enough that it would be better to
get more eyes on the three patches for these specific branches.  Now,
this is much simpler since we don't have to worry about the
prefetching.

For these versions, it struck me that we *require* 6860198 so as the
header validation is able to correctly happen across page reads when
doing crash recovery.   I have added this change in the patches
attached for simplicity, but 6860198 should be applied first on 12~14
as an independent change.  I can do that as a first step, if you are
OK with that.

I haven't taken the time to rewrite the commit message yet for the
main patch ;p
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Tue, Sep 5, 2023 at 3:19 PM Michael Paquier <michael@paquier.xyz> wrote:
> And here you go for all the three branches, with all the tests
> passing.  This change is complicated enough that it would be better to
> get more eyes on the three patches for these specific branches.  Now,
> this is much simpler since we don't have to worry about the
> prefetching.

Thanks!  Yeah.

> For these versions, it struck me that we *require* 6860198 so as the
> header validation is able to correctly happen across page reads when
> doing crash recovery.   I have added this change in the patches
> attached for simplicity, but 6860198 should be applied first on 12~14
> as an independent change.  I can do that as a first step, if you are
> OK with that.

Ahhhh, that explains why back-patching was unexpectedly unsuccessful
when I tried it.  I have added Fujii-san to the CC list.  I don't see
any problem with that plan but it would be good to hear from him.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Sep 05, 2023 at 03:58:42PM +1200, Thomas Munro wrote:
> Ahhhh, that explains why back-patching was unexpectedly unsuccessful
> when I tried it.  I have added Fujii-san to the CC list.  I don't see
> any problem with that plan but it would be good to hear from him.

Right.  Let's wait for a bit and see if there are any comments on this
point.  With 16.0 in sight next week, I'd like to propose that we move
on with this issue once next week's tag is pushed.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Sep 05, 2023 at 01:13:16PM +0900, Michael Paquier wrote:
> Right.  Let's wait for a bit and see if there are any comments on this
> point.  With 16.0 in sight next week, I'd like to propose that we move
> on with this issue once next week's tag is pushed.

Hearing nothing, I have applied the backpatch of 0668719801 to 12~14.
Please find attached the remaining patches for all the stable
branches.

Thomas, would you prefer handling that yourself once 16.0 is tagged
with a last round of checks?  Or should I?  (Still need to edit the
commit message a bit..)
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Tue, Sep 12, 2023 at 1:01 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Sep 05, 2023 at 01:13:16PM +0900, Michael Paquier wrote:
> > Right.  Let's wait for a bit and see if there are any comments on this
> > point.  With 16.0 in sight next week, I'd like to propose that we move
> > on with this issue once next week's tag is pushed.
>
> Hearing nothing, I have applied the backpatch of 0668719801 to 12~14.
> Please find attached the remaining patches for all the stable
> branches.

Thanks!

> Thomas, would you prefer handling that yourself once 16.0 is tagged
> with a last round of checks?  Or should I?  (Still need to edit the
> commit message a bit..)

I'll do that (and update the message).  Thanks for all the help!



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Sep 12, 2023 at 02:08:09PM +1200, Thomas Munro wrote:
> I'll do that (and update the message).  Thanks for all the help!

Cool, thanks!
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Tue, Sep 12, 2023 at 2:12 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Sep 12, 2023 at 02:08:09PM +1200, Thomas Munro wrote:
> > I'll do that (and update the message).  Thanks for all the help!
>
> Cool, thanks!

I rebased (bumped the test number 038 -> 039), updated the commit
message[1], tested on a big endian machine (AIX/POWER), but then I
found there is something not quite right or stable with the WAL
positioning logic.  On one of my machines, but not another, the
back-patch to 12 fails in one of the tests unless I hack it to insert
a bit more WAL, and I think that's telling me that Perl code to
control the insertion point is not quite bullet proof enough yet.
I'll need to dig into this some more on Monday.

[1] https://github.com/macdice/postgres/commits/fix-12



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Sep 16, 2023 at 10:07:30AM +1200, Thomas Munro wrote:
> I rebased (bumped the test number 038 -> 039), updated the commit
> message[1], tested on a big endian machine (AIX/POWER), but then I
> found there is something not quite right or stable with the WAL
> positioning logic.  On one of my machines, but not another, the
> back-patch to 12 fails in one of the tests unless I hack it to insert
> a bit more WAL, and I think that's telling me that Perl code to
> control the insertion point is not quite bullet proof enough yet.
> I'll need to dig into this some more on Monday.
>
> [1] https://github.com/macdice/postgres/commits/fix-12

Hmm.  What was the test that failed?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 16, 2023 at 12:03 PM Michael Paquier <michael@paquier.xyz> wrote:
> > [1] https://github.com/macdice/postgres/commits/fix-12
>
> Hmm.  What was the test that failed?

$ make -s -C src/test/recovery/ check PROVE_TESTS=t/039*
t/039_end_of_wal.pl .. 4/?
#   Failed test 'xlp_magic zero'
#   at t/039_end_of_wal.pl line 312.

not ok 5 - xlp_magic zero

Where the log should say "invalid magic number 0000" I see:

2023-09-16 12:13:07.331 NZST [156812] LOG:  record with incorrect
prev-link 0/16B60C0 at 0/16B6120

It has to do with initial WAL position after initdb, because I get
this only on Debian, on REL_12_STABLE (with the commit listed above on
my public fix-12 branch) and only with --with-icu, but not without it,
and I can't repro it on my other local OSes.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Sep 16, 2023 at 12:20:36PM +1200, Thomas Munro wrote:
> It has to do with initial WAL position after initdb, because I get
> this only on Debian, on REL_12_STABLE (with the commit listed above on
> my public fix-12 branch) and only with --with-icu, but not without it,
> and I can't repro it on my other local OSes.

That's close to my dev environment, with SID and x64 and the only
difference your branch and my local branch is a rename of the TAP
file, and I cannot see the problem.

Some random thoughts: perhaps reducing $page_threshold would help with
more records inserted to get to the limit of a page, reducing noise?
Or perhaps skipping a few pages would?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Alexander Lakhin
Date:
Hello,

16.09.2023 03:20, Thomas Munro wrote:
> On Sat, Sep 16, 2023 at 12:03 PM Michael Paquier <michael@paquier.xyz> wrote:
>>> [1] https://github.com/macdice/postgres/commits/fix-12
>> Hmm.  What was the test that failed?
> $ make -s -C src/test/recovery/ check PROVE_TESTS=t/039*
> t/039_end_of_wal.pl .. 4/?
> #   Failed test 'xlp_magic zero'
> #   at t/039_end_of_wal.pl line 312.
>
> not ok 5 - xlp_magic zero
>
> Where the log should say "invalid magic number 0000" I see:
>
> 2023-09-16 12:13:07.331 NZST [156812] LOG:  record with incorrect
> prev-link 0/16B60C0 at 0/16B6120
>
> It has to do with initial WAL position after initdb, because I get
> this only on Debian, on REL_12_STABLE (with the commit listed above on
> my public fix-12 branch) and only with --with-icu, but not without it,
> and I can't repro it on my other local OSes.

I tried to reproduce the failure on Debian 9, 10, 11, but not succeeded yet.
Though I got another error on Debian 9:
t/039_end_of_wal.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
No subtests run
...
cat src/test/recovery/tmp_check/log/regress_log_039_end_of_wal
could not find match in header access/xlog_internal.h

It looks like the construction "@{^CAPTURE}" used in scan_server_header()
is not supported by Perl 5.24, which is included in Debian stretch:
https://perldoc.perl.org/variables/@%7B%5ECAPTURE%7D
I replaced it with
@match = ($1);
and that worked for me.

Also, I observed that "wal_log_hints = on" in extra.config, which I use via
"TEMP_CONFIG=extra.config make check-world" makes the test fail too, though
check-world passes fine without the new test.
Maybe that's not an issue, and probably there are other parameters, which
might affect this test, but I'm somewhat confused by the fact that only this
test breaks with it.

Best regards,
Alexander



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 16, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> It looks like the construction "@{^CAPTURE}" used in scan_server_header()
> is not supported by Perl 5.24, which is included in Debian stretch:
> https://perldoc.perl.org/variables/@%7B%5ECAPTURE%7D
> I replaced it with
> @match = ($1);
> and that worked for me.

Thanks for catching that!  I see that %{^CAPTURE} was added to Perl
5.25.7, but we document Perl 5.14 or later as the requirement
currently, and for REL_12_STABLE it's Perl 5.8.3.  That would
certainly have turned a few build farm fossils red.

Your replacement only returns the first captured group.  I don't know
Perl, but I think we can just assign the whole array to @match
directly, to maintain the semantics that Michael wanted, like so:

-               if ($line =~ /^$regexp/)
+               if (@match = $line =~ /^$regexp/)
                {
-                       # Found match, so store all the results.
-                       @match = @{^CAPTURE};
                        last;
                }

Better Perl welcome.

Looking into your observation about wal_log_hints...



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 16, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> Also, I observed that "wal_log_hints = on" in extra.config, which I use via
> "TEMP_CONFIG=extra.config make check-world" makes the test fail too, though
> check-world passes fine without the new test.
> Maybe that's not an issue, and probably there are other parameters, which
> might affect this test, but I'm somewhat confused by the fact that only this
> test breaks with it.

The problem was that the Perl functions advance_XXX() would sometimes
see an "unflushed" insert position caused by hint records and other
incidental stuff, and then we'd write our contrived header at a
disconnected location after a bunch of (unwritten) zeroes.  The
solution in this version is to start each test with
emit_message($node, 0), which flushes a small record.

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Sep 19, 2023 at 09:25:34PM +1200, Thomas Munro wrote:
> The problem was that the Perl functions advance_XXX() would sometimes
> see an "unflushed" insert position caused by hint records and other
> incidental stuff, and then we'd write our contrived header at a
> disconnected location after a bunch of (unwritten) zeroes.  The
> solution in this version is to start each test with
> emit_message($node, 0), which flushes a small record.

Makes sense.  Does this also address the other issue you've noticed on
v12 when ICU is enabled (the one with the initial LSN changed after
initdb)?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Tue, Sep 19, 2023 at 9:53 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Sep 19, 2023 at 09:25:34PM +1200, Thomas Munro wrote:
> > The problem was that the Perl functions advance_XXX() would sometimes
> > see an "unflushed" insert position caused by hint records and other
> > incidental stuff, and then we'd write our contrived header at a
> > disconnected location after a bunch of (unwritten) zeroes.  The
> > solution in this version is to start each test with
> > emit_message($node, 0), which flushes a small record.
>
> Makes sense.  Does this also address the other issue you've noticed on
> v12 when ICU is enabled (the one with the initial LSN changed after
> initdb)?

Yep.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Alexander Lakhin
Date:
19.09.2023 13:09, Thomas Munro wrote:
On Tue, Sep 19, 2023 at 9:53 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Sep 19, 2023 at 09:25:34PM +1200, Thomas Munro wrote:
The problem was that the Perl functions advance_XXX() would sometimes
see an "unflushed" insert position caused by hint records and other
incidental stuff, and then we'd write our contrived header at a
disconnected location after a bunch of (unwritten) zeroes.  The
solution in this version is to start each test with
emit_message($node, 0), which flushes a small record.


Thanks for the fixes!

I've tested this patch version on Debian 9 and it works fine now.
But as to wal_log_hints, I see that v7 patches work for REL_12_STABLE and
REL_13_STABLE, but not for other branches, though v6 patches definitely
failed on REL_12_STABLE and REL_13_STABLE too. So the behavior improved,
but not for all versions, for some reason (I haven't looked deeper yet).
Also I've noticed a few minor discrepancies in the patches.
A commit message in patch.15 contains the sentence "We should fix ...",
not "We could fix ...", as in commit messages for other versions.
Also, patch.12 contains a duplicated "Reviewed-by" string.
And may be "xlp_prev" is a typo?

Best regards,
Alexander

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Wed, Sep 20, 2023 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've tested this patch version on Debian 9 and it works fine now.

Progress...

> But as to wal_log_hints, I see that v7 patches work for REL_12_STABLE and
> REL_13_STABLE, but not for other branches, though v6 patches definitely
> failed on REL_12_STABLE and REL_13_STABLE too. So the behavior improved,
> but not for all versions, for some reason (I haven't looked deeper yet).

Ahh, yeah, there was a second problem screwing up the LSN accounting.
I was trying to write correct xl_prev links, but also emitting
optional padding records to get into the right position (depending on
initial conditions varying between branches etc), and I forgot about
the extra COMMIT records that emit_message() generates.  Which, I
guess, comes back to Michael's observation that this would all be a
bit easier if we had a way to emit and flush a single record...

The solution in this version is to call get_insert_len() instead of
using the result of emit_message() for the values returned by the
advance_XXX() functions.  The result of emit_message() is actually the
LSN of the following COMMIT record so can't be used directly for
building xl_prev chains.

> Also I've noticed a few minor discrepancies in the patches.
> A commit message in patch.15 contains the sentence "We should fix ...",
> not "We could fix ...", as in commit messages for other versions.
> Also, patch.12 contains a duplicated "Reviewed-by" string.
> And may be "xlp_prev" is a typo?

Fixed.

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Wed, Sep 20, 2023 at 10:51:12AM +1200, Thomas Munro wrote:
> Ahh, yeah, there was a second problem screwing up the LSN accounting.
> I was trying to write correct xl_prev links, but also emitting
> optional padding records to get into the right position (depending on
> initial conditions varying between branches etc), and I forgot about
> the extra COMMIT records that emit_message() generates.  Which, I
> guess, comes back to Michael's observation that this would all be a
> bit easier if we had a way to emit and flush a single record...

Unfortunately this part is not going to be backpatched based on the
current consensus, so this is goint to need an alternate and/or fluffy
solution to force the flushes at given points:
https://www.postgresql.org/message-id/20230816003333.7hn2rx5m2l7una3d@awork3.anarazel.de

The test could be largely simplified on HEAD once the flush option is
available.  I'm just waiting a bit more on the other thread, keeping
an eye on the temperature, but it looks like nobody would complain to
make that optional, with the default being what we did previously, at
least..

> The solution in this version is to call get_insert_len() instead of
> using the result of emit_message() for the values returned by the
> advance_XXX() functions.  The result of emit_message() is actually the
> LSN of the following COMMIT record so can't be used directly for
> building xl_prev chains.

And by doing so the test paths would enter once again in the inner
loops generating the records so as we'll only get out once we know
that we are at the border wanted, enough to stabilize the tests.
Smart.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Alexander Lakhin
Date:
20.09.2023 01:51, Thomas Munro wrote:
> The solution in this version is to call get_insert_len() instead of
> using the result of emit_message() for the values returned by the
> advance_XXX() functions.  The result of emit_message() is actually the
> LSN of the following COMMIT record so can't be used directly for
> building xl_prev chains.
>
>> And may be "xlp_prev" is a typo?
> Fixed.

I've retested this version (patches for all branches, plus wal_log_hints,
plus Valgrind, and also retested code changes with my
099_restart_with_stanby.pl) and found no issues. IMHO, this patch is
ready to float off.

Thank you!

Best regards,
Alexander



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
A thought: commit 8fcb32db prevented us from logging messages that are
too big to be decoded, but it wasn't back-patched.  I think that means
that in older branches, there is a behaviour change unrelated to the
"garbage bytes" problem discussed in this thread, and separate also
from the out-of-memory problem.  If someone generates a record too big
to decode, say with pg_logical_emit_message(), we will fail
differently.  Before this patch set, we'd bogusly detect end-of-WAL,
and after this patch we'd fail to palloc and recovery would bogusly
fail.  Which outcome is more bogus is hard to answer, and clearly we
should prevent it upstream, but didn't for technical reasons.  Do you
agree that that is a separate topic that doesn't prevent us from
committing this fix?



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Fri, Sep 22, 2023 at 03:11:54PM +1200, Thomas Munro wrote:
> A thought: commit 8fcb32db prevented us from logging messages that are
> too big to be decoded, but it wasn't back-patched.

Yes, there was one part done in ffd1b6bb6f8 that has required ABI
breakages, as well.  There is an argument against a backpatch as a
set of records in a given range may fail to replay while they were
allowed before (I forgot the exact math, but I recall that this was
for records larger than XLogRecordMaxSize, still lower than the max
allocation mark.).

> I think that means
> that in older branches, there is a behaviour change unrelated to the
> "garbage bytes" problem discussed in this thread, and separate also
> from the out-of-memory problem.  If someone generates a record too big
> to decode, say with pg_logical_emit_message(), we will fail
> differently.  Before this patch set, we'd bogusly detect end-of-WAL,
> and after this patch we'd fail to palloc and recovery would bogusly
> fail.  Which outcome is more bogus is hard to answer, and clearly we
> should prevent it upstream, but didn't for technical reasons.  Do you
> agree that that is a separate topic that doesn't prevent us from
> committing this fix?

I don't see why it's a problem on HEAD: a startup process reacts the
same way for the end of WAL or an OOM.  If we were to FATAL hard on
all stable branches for non-FRONTEND on OOM, which is something we'll
have to do anyway, then this patch set improves the situation because
we would fail in what I'd see is better in this case: with an OOM.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Fri, Sep 22, 2023 at 4:02 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Sep 22, 2023 at 03:11:54PM +1200, Thomas Munro wrote:
> > A thought: commit 8fcb32db prevented us from logging messages that are
> > too big to be decoded, but it wasn't back-patched.
>
> Yes, there was one part done in ffd1b6bb6f8 that has required ABI
> breakages, as well.  There is an argument against a backpatch as a
> set of records in a given range may fail to replay while they were
> allowed before (I forgot the exact math, but I recall that this was
> for records larger than XLogRecordMaxSize, still lower than the max
> allocation mark.).

Hmm, OK, tricky choices.  Anyway, I guess the next thing to straighten
out in this area is the OOM policy.

Pushed.  Thanks Alexander and everyone for the report, testing and
help!  Time to watch the build farm to see if that Perl turns out to
be robust enough...



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Sep 23, 2023 at 11:35:05AM +1200, Thomas Munro wrote:
> Hmm, OK, tricky choices.  Anyway, I guess the next thing to straighten
> out in this area is the OOM policy.
>
> Pushed.  Thanks Alexander and everyone for the report, testing and
> help!  Time to watch the build farm to see if that Perl turns out to
> be robust enough...

Looks pretty stable to me.  I can see one thing in wrasse related to
pack():
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2023-09-22%2023%3A23%3A23
Invalid type 'Q' in pack at t/039_end_of_wal.pl line 139.

I don't have the room to look at that now, perhaps a bit later today
depending on the flow of the wind.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 23, 2023 at 12:05 PM Michael Paquier <michael@paquier.xyz> wrote:
> Looks pretty stable to me.  I can see one thing in wrasse related to
> pack():
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2023-09-22%2023%3A23%3A23
> Invalid type 'Q' in pack at t/039_end_of_wal.pl line 139.

Oh, so although we were testing on 32 bit systems via CI (task
"test_world_32" on "Linux - Debian Bullseye - Meson"), in that case
the perl binary is still 64 bit.  Apparently 32 bit perl binary (at
least of a certain vintage) doesn't like 'Q'.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Oh, so although we were testing on 32 bit systems via CI (task
> "test_world_32" on "Linux - Debian Bullseye - Meson"), in that case
> the perl binary is still 64 bit.  Apparently 32 bit perl binary (at
> least of a certain vintage) doesn't like 'Q'.

We discovered this once before, cf 945d2cb7d.

            regards, tom lane



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 23, 2023 at 12:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Oh, so although we were testing on 32 bit systems via CI (task
> > "test_world_32" on "Linux - Debian Bullseye - Meson"), in that case
> > the perl binary is still 64 bit.  Apparently 32 bit perl binary (at
> > least of a certain vintage) doesn't like 'Q'.
>
> We discovered this once before, cf 945d2cb7d.

Here is my proposed fix.  I couldn't quite use that trick, but we can
safely assume that the higher order bits of any LSN generated by this
test are zero.  Patch tested on x86 and big-endian POWER.  If no one
has a more idiomatic perl trick for this, I'll push this soon.

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 23, 2023 at 12:05 PM Michael Paquier <michael@paquier.xyz> wrote:
> Looks pretty stable to me.  I can see one thing in wrasse related to
> pack():

Hmm, copperhead (riscv) showed an unusual failure, a segfault in
suspiciously nearby code.  I don't immediately know what that's about,
let's see if we get more clues...



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 23, 2023 at 1:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here is my proposed fix.  I couldn't quite use that trick, but we can
> safely assume that the higher order bits of any LSN generated by this
> test are zero.  Patch tested on x86 and big-endian POWER.  If no one
> has a more idiomatic perl trick for this, I'll push this soon.

And done, after running through pgperltidy.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sat, Sep 23, 2023 at 02:02:02PM +1200, Thomas Munro wrote:
> Hmm, copperhead (riscv) showed an unusual failure, a segfault in
> suspiciously nearby code.  I don't immediately know what that's about,
> let's see if we get more clues...

Yep, something is going on with the prefetching code:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-09-22%2023%3A16%3A33

Using host libthread_db library
"/lib/riscv64-linux-gnu/libthread_db.so.1". Core was generated by
`postgres: paris: startup recovering 000000030000000000000003
'. Program terminated with signal SIGSEGV, Segmentation fault.
#0  pg_comp_crc32c_sb8 (crc=1613114916, crc@entry=4294967295,
data=data@entry=0x2af9e00d48, len=<optimized out>) at
pg_crc32c_sb8.c:56 56 uint32 a = *p4++ ^ crc;
#0  pg_comp_crc32c_sb8 (crc=1613114916, crc@entry=4294967295,
data=data@entry=0x2af9e00d48, len=<optimized out>) at
pg_crc32c_sb8.c:56
#1  0x0000002ad59a1536 in ValidXLogRecord (state=0x2af9db1fc0,
record=0x2af9e00d30, recptr=50520048) at xlogreader.c:1195
#2  0x0000002ad59a285a in XLogDecodeNextRecord
(state=state@entry=0x2af9db1fc0, nonblocking=<optimized out>) at
xlogreader.c:842
#3  0x0000002ad59a28c0 in XLogReadAhead (state=0x2af9db1fc0,
nonblocking=nonblocking@entry=false) at xlogreader.c:969
#4  0x0000002ad59a0996 in XLogPrefetcherNextBlock
(pgsr_private=184580836680, lsn=0x2af9e14618) at xlogprefetcher.c:496
#5  0x0000002ad59a11c8 in lrq_prefetch (lrq=<optimized out>) at
xlogprefetcher.c:256
#6  lrq_complete_lsn (lsn=<optimized out>, lrq=0x2af9e145c8) at
xlogprefetcher.c:294
#7  XLogPrefetcherReadRecord
(prefetcher=prefetcher@entry=0x2af9e00d48,
errmsg=errmsg@entry=0x3fec9a2bf0) at xlogprefetcher.c:1041

The stack may point out at a different issue, but perhaps this is a
matter where we're returning now XLREAD_SUCCESS where previously we
had XLREAD_FAIL, causing this code to fail thinking that the block was
valid while it's not?
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sat, Sep 23, 2023 at 4:44 PM Michael Paquier <michael@paquier.xyz> wrote:
> The stack may point out at a different issue, but perhaps this is a
> matter where we're returning now XLREAD_SUCCESS where previously we
> had XLREAD_FAIL, causing this code to fail thinking that the block was
> valid while it's not?

"grison" has a little more detail --  we see
pg_comp_crc32c_sb8(len=4294636456).  I'm wondering how to reproduce
this, but among the questions that jump out I have: why was it ever OK
that we load record->xl_tot_len into total_len, perform header
validation, determine that total_len < len (= this record is all on
one page, no reassembly loop needed, so now we're in the single-page
branch), then call ReadPageInternal() again, then call
ValidXLogRecord() which internally loads record->xl_tot_len *again*?
ReadPageInternal() might have changed xl_tot_len, no?  That seems to
be a possible pathway to reading past the end of the buffer in the CRC
check, no?

If that value didn't change underneath us, I think we'd need an
explanation for how we finished up in the single-page branch at
xlogreader.c:842 with a large xl_tot_len, which I'm not seeing yet,
though it might take more coffee.  (Possibly supporting the re-read
theory is the fact that it's only happening on a few very slow
computers, though I have no idea why it would only happen on master
[so far at least].)



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sun, Sep 24, 2023 at 09:48:42AM +1300, Thomas Munro wrote:
> "grison" has a little more detail --  we see
> pg_comp_crc32c_sb8(len=4294636456).  I'm wondering how to reproduce
> this, but among the questions that jump out I have: why was it ever OK
> that we load record->xl_tot_len into total_len, perform header
> validation, determine that total_len < len (= this record is all on
> one page, no reassembly loop needed, so now we're in the single-page
> branch), then call ReadPageInternal() again, then call
> ValidXLogRecord() which internally loads record->xl_tot_len *again*?
> ReadPageInternal() might have changed xl_tot_len, no?  That seems to
> be a possible pathway to reading past the end of the buffer in the CRC
> check, no?
>
> If that value didn't change underneath us, I think we'd need an
> explanation for how we finished up in the single-page branch at
> xlogreader.c:842 with a large xl_tot_len, which I'm not seeing yet,
> though it might take more coffee.  (Possibly supporting the re-read
> theory is the fact that it's only happening on a few very slow
> computers, though I have no idea why it would only happen on master
> [so far at least].)

Hmm, it looks pretty clear that this is a HEAD-only thing as the
buildfarm shows and as you say, and my primary suspect here would be
71e4cc6b8ec6, I think.  Any race condition underneath it would be
easier to see on slower machines.  So it's likely possible that this
has messed up the page insertion logic.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Sun, Sep 24, 2023 at 12:23 PM Michael Paquier <michael@paquier.xyz> wrote:
> Hmm, it looks pretty clear that this is a HEAD-only thing as the
> buildfarm shows and as you say, and my primary suspect here would be
> 71e4cc6b8ec6, I think.  Any race condition underneath it would be
> easier to see on slower machines.  So it's likely possible that this
> has messed up the page insertion logic.

I see there was a failure on 16 on the very slow AIX box, and I have
access so looking into that...



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote:
> I see there was a failure on 16 on the very slow AIX box, and I have
> access so looking into that...

Lucky you, if I may say ;)

A bunch of architectures that are not Intel are failing.  Here is a
summary based on the buildfarm reports:
topminnow, mips64el with gcc 4.9.2
mereswine, ARMv7 with gcc 10.2.1
sungazer, ppc64 with gcc 8.3.0
frogfish, mips64el with gcc 4.6.3
mamba, macppc with gcc 10.4.0
gull, ARMv7 with clang 13.0.0
grison, ARMv7 with gcc 4.6.3
copperhead, riscv64 with gcc 10.X

The only thing close to that I have close by is tanager on Armv7 (it
has not reported to the buildfarm for a few weeks as it has
overheated because of the summer here, but I've put it back online
now).  However, it has passed a few hundred cycles with both gcc and
clang yesterday, on top of having a clean buildfarm run.

With sungazer now failing on REL_16_STABLE, it feels to me that we are
actually looking at two bugs?  One on HEAD, and one in stable
branches?  For HEAD and the 2PC failure, the records up to PREPARE
TRANSACTION should be replayed by the standby getting promoted, but
I'd rather dig into that with a host that's able to report the
failure.

copperhead seems to be one of the failing hosts that's able to compile
things quickly.  Tom, Noah, or copperhead's owner, could it be
possible to get access to one of the hosts that are failing for more
investigation?  I would not do more than compiling the code and check
after the state of the 2PC test for this promotion failure.
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote:
> > I see there was a failure on 16 on the very slow AIX box, and I have
> > access so looking into that...
>
> Lucky you, if I may say ;)

FTR anyone involved with an open source project can get an account on
the GCC compile farm machines.  That particular machine is so
overloaded that it's practically unusable (~8 hours to run the test,
hard to run vi etc).

> A bunch of architectures that are not Intel are failing.  Here is a
> summary based on the buildfarm reports:
> topminnow, mips64el with gcc 4.9.2
> mereswine, ARMv7 with gcc 10.2.1
> sungazer, ppc64 with gcc 8.3.0
> frogfish, mips64el with gcc 4.6.3
> mamba, macppc with gcc 10.4.0
> gull, ARMv7 with clang 13.0.0
> grison, ARMv7 with gcc 4.6.3
> copperhead, riscv64 with gcc 10.X
>
> The only thing close to that I have close by is tanager on Armv7 (it
> has not reported to the buildfarm for a few weeks as it has
> overheated because of the summer here, but I've put it back online
> now).  However, it has passed a few hundred cycles with both gcc and
> clang yesterday, on top of having a clean buildfarm run.

One thing that the failing systems have in common is that they are
extremely slow.  3 to 8 hours to complete the tests.  turaco is an
armv7 system that doesn't fail, but it's much faster.  At a guess,
probably something like an armv8 CPU that is just running 32 bit armv7
software, not  a real old school armv7 chip.

Which gives me the idea to try these tests under qemu...

> With sungazer now failing on REL_16_STABLE, it feels to me that we are
> actually looking at two bugs?  One on HEAD, and one in stable
> branches?  For HEAD and the 2PC failure, the records up to PREPARE
> TRANSACTION should be replayed by the standby getting promoted, but
> I'd rather dig into that with a host that's able to report the
> failure.

Oh, right yeah that is quite different and could even be unrelated.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Noah Misch
Date:
On Mon, Sep 25, 2023 at 01:18:56PM +1300, Thomas Munro wrote:
> On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote:
> > > I see there was a failure on 16 on the very slow AIX box, and I have
> > > access so looking into that...
> >
> > Lucky you, if I may say ;)
> 
> FTR anyone involved with an open source project can get an account on
> the GCC compile farm machines.  That particular machine is so
> overloaded that it's practically unusable (~8 hours to run the test,
> hard to run vi etc).

topminnow (gcc23) and frogfish (gcc22) are also part of the compile farm.
They're faster than sungazer (gcc111) for src/test/recovery.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
>> A bunch of architectures that are not Intel are failing.  Here is a
>> summary based on the buildfarm reports:
>> mamba, macppc with gcc 10.4.0

That one is mine, so let me know if there's something particular
you'd like me to check.

            regards, tom lane



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Sun, Sep 24, 2023 at 11:07:08PM -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
>>> A bunch of architectures that are not Intel are failing.  Here is a
>>> summary based on the buildfarm reports:
>>> mamba, macppc with gcc 10.4.0
>
> That one is mine, so let me know if there's something particular
> you'd like me to check.

That would help, thanks!  Could it be possible to check if the
attached patch applied on HEAD influences the fate of the recovery
test 009_twophase.pl?  That's one thing I am really doubting about,
and we've not done much since REL_16_STABLE has been forked..

(I'm still trying to get an environment where this can be reproduced,
but qemu takes time.)
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Sep 25, 2023 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
> >> A bunch of architectures that are not Intel are failing.  Here is a
> >> summary based on the buildfarm reports:
> >> mamba, macppc with gcc 10.4.0
>
> That one is mine, so let me know if there's something particular
> you'd like me to check.

Thanks!  But after spending the best part of a day installing and
compiling stuff at snail's pace, I have just this moment managed to
reproduce this reliably on an emulator running Debian armhf (= armv7).
So hopefully I can track the issue down from here without bothering
build farm owners.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Mon, Sep 25, 2023 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That one is mine, so let me know if there's something particular
>> you'd like me to check.

> Thanks!  But after spending the best part of a day installing and
> compiling stuff at snail's pace, I have just this moment managed to
> reproduce this reliably on an emulator running Debian armhf (= armv7).
> So hopefully I can track the issue down from here without bothering
> build farm owners.

FWIW, here's what I see on mamba:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  pg_comp_crc32c_sb8 (crc=3197562372, crc@entry=4294967295,
    data=data@entry=0xfda6e030, len=<optimized out>) at pg_crc32c_sb8.c:56
56                      uint32          a = *p4++ ^ crc;
(gdb) bt
#0  pg_comp_crc32c_sb8 (crc=3197562372, crc@entry=4294967295,
    data=data@entry=0xfda6e030, len=<optimized out>) at pg_crc32c_sb8.c:56
#1  0x0195ae5c in ValidXLogRecord (state=0xfde5cc88, record=0xfda6e018,
    recptr=<optimized out>) at xlogreader.c:1195
#2  0x0195cf18 in XLogDecodeNextRecord (state=state@entry=0xfde5cc88,
    nonblocking=<optimized out>) at xlogreader.c:842
#3  0x0195d0c8 in XLogReadAhead (state=0xfde5cc88,
    nonblocking=nonblocking@entry=false) at xlogreader.c:969
#4  0x01959760 in XLogPrefetcherNextBlock (pgsr_private=4259700408,
    lsn=0xfda6b8a0) at xlogprefetcher.c:496
#5  0x0195a72c in lrq_prefetch (lrq=<optimized out>) at xlogprefetcher.c:256
#6  lrq_complete_lsn (lsn=50520048, lrq=0xfda6b858) at xlogprefetcher.c:294
#7  XLogPrefetcherReadRecord (prefetcher=prefetcher@entry=0xfde5deb8,
    errmsg=errmsg@entry=0xffffdd88) at xlogprefetcher.c:1041
#8  0x01960054 in ReadRecord (xlogprefetcher=0xfde5deb8, emode=emode@entry=15,
    fetching_ckpt=fetching_ckpt@entry=false, replayTLI=replayTLI@entry=3)
    at xlogrecovery.c:3067
#9  0x01963188 in PerformWalRecovery () at xlogrecovery.c:1756
#10 0x01952594 in StartupXLOG () at xlog.c:5470
#11 0x01c26e3c in StartupProcessMain () at startup.c:267
#12 0x01c1b504 in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess)
    at auxprocess.c:141
#13 0x01c2276c in StartChildProcess (type=StartupProcess) at postmaster.c:5347
#14 0x01c260ec in PostmasterMain (argc=argc@entry=4,
    argv=argv@entry=0xffffe1e4) at postmaster.c:1457
#15 0x01ee39d8 in main (argc=4, argv=0xffffe1e4) at main.c:198

(gdb) f 1
#1  0x0195ae5c in ValidXLogRecord (state=0xfde5cc88, record=0xfda6e018,
    recptr=<optimized out>) at xlogreader.c:1195
1195            COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
(gdb) p *record
$3 = {xl_tot_len = 0, xl_xid = 0, xl_prev = 0, xl_info = 0 '\000',
  xl_rmid = 0 '\000', xl_crc = 0}

Sure looks like ValidXLogRecord is assuming that record->xl_tot_len
can be trusted without reservation.

            regards, tom lane



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Sep 25, 2023 at 5:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sure looks like ValidXLogRecord is assuming that record->xl_tot_len
> can be trusted without reservation.

I think there is something at least theoretically a bit fishy about
the ancient code that re-reads the page, doesn't consider short reads,
and assumes that record->xl_tot_len hasn't changed since last time it
loaded and validated it.  But I think that's also very unlikely to go
wrong, and it's not the problem here.

The problem is that bae868ca removed something we still need:

-               /* XXX: more validation should be done here */
-               if (total_len < SizeOfXLogRecord)
-               {
-                       report_invalid_record(state,
-
"invalid record length at %X/%X: expected at least %u, got %u",
-
LSN_FORMAT_ARGS(RecPtr),
-
(uint32) SizeOfXLogRecord, total_len);
-                       goto err;
-               }
+               /* We'll validate the header once we have the next page. */
                gotheader = false;

If you happened to run into zeroes where an xl_tot_len is wanted right
at the end of a page (or any value not big enough to get you to the
next page), we'll fall through to the single-page branch, and then go
directly to the CRC check, but then ValidXLogRecord() subtracts
SizeOfXLogRecord and gets a crazy big length.  The CRC implementation
routines on modern computers happened to use pointer arithmetic that
terminates immediately without accessing any memory, which is why
nothing was obviously wrong on most systems.  The _sb8.c
implementation for older ARM, MIPS etc use a length-based loop, and
read off into deep space.

Draft patch attached, including a new test for 039_end_of_wal.pl that
fails on all systems without the above code.

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Sun, Sep 24, 2023 at 11:07:08PM -0400, Tom Lane wrote:
>> That one is mine, so let me know if there's something particular
>> you'd like me to check.

> That would help, thanks!  Could it be possible to check if the
> attached patch applied on HEAD influences the fate of the recovery
> test 009_twophase.pl?  That's one thing I am really doubting about,
> and we've not done much since REL_16_STABLE has been forked..

Applying this does *not* fix it.  Stack trace looks the same
as in my previous message.

            regards, tom lane



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> If you happened to run into zeroes where an xl_tot_len is wanted right
> at the end of a page (or any value not big enough to get you to the
> next page), we'll fall through to the single-page branch, and then go
> directly to the CRC check, but then ValidXLogRecord() subtracts
> SizeOfXLogRecord and gets a crazy big length.  The CRC implementation
> routines on modern computers happened to use pointer arithmetic that
> terminates immediately without accessing any memory, which is why
> nothing was obviously wrong on most systems.  The _sb8.c
> implementation for older ARM, MIPS etc use a length-based loop, and
> read off into deep space.

Ah-hah.  Maybe there should be an Assert in the modern CRC code path
that detects the bogus arguments?

> Draft patch attached, including a new test for 039_end_of_wal.pl that
> fails on all systems without the above code.

Both the 009 and 039 tests pass on mamba with this patch.  I did
not read the code, but testing-wise it seems good.

            regards, tom lane



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Mon, Sep 25, 2023 at 01:13:37AM -0400, Tom Lane wrote:
> Applying this does *not* fix it.  Stack trace looks the same
> as in my previous message.

Okay.  Thanks anyway for the test!
--
Michael

Attachment

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Thomas Munro
Date:
On Mon, Sep 25, 2023 at 6:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ah-hah.  Maybe there should be an Assert in the modern CRC code path
> that detects the bogus arguments?

Yeah.  Assertion added, and pushed.  Thanks for confirming on mamba,
and thanks also to Alexander for an off-list test report.



Re: BUG #17928: Standby fails to decode WAL on termination of primary

From
Michael Paquier
Date:
On Tue, Sep 26, 2023 at 11:14:52AM +1300, Thomas Munro wrote:
> Yeah.  Assertion added, and pushed.  Thanks for confirming on mamba,
> and thanks also to Alexander for an off-list test report.

Yeah++, thanks!
--
Michael

Attachment