BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id 17928-aa92416a70ff44a2@postgresql.org
Whole thread Raw
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17928
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 22.04
Description:

When a primary server is stopped, standby might fail to replay WAL with
the memory allocation error, for example:
2023-05-11 08:42:38.254 MSK [1706654] LOG:  recovery restart point at
7/12012E88
2023-05-11 08:42:38.254 MSK [1706654] DETAIL:  Last completed transaction
was at log time 2023-05-11 08:42:38.245399+03.
2023-05-11 08:42:38.420 MSK [1706654] LOG:  restartpoint starting: wal
2023-05-11 08:42:38.451 MSK [1725929] FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2023-05-11 08:42:38.468 MSK [1706656] FATAL:  invalid memory alloc request
size 2021163525
2023-05-11 08:42:38.469 MSK [1706643] LOG:  startup process (PID 1706656)
exited with exit code 1
2023-05-11 08:42:38.469 MSK [1706643] LOG:  terminating any other active
server processes
2023-05-11 08:42:38.469 MSK [1706643] LOG:  shutting down due to startup
process failure
2023-05-11 08:42:38.470 MSK [1706643] LOG:  database system is shut down

The call stack of the error:
...
#6  0x000055cfabab8e97 in palloc_extended (size=2021163525, flags=2) at
mcxt.c:1143
#7  0x000055cfab456ae0 in XLogReadRecordAlloc (state=0x55cfacef4a08,
xl_tot_len=2021161080, 
    allow_oversized=true) at xlogreader.c:524
#8  0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
#9  0x000055cfab4575c0 in XLogReadAhead (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:960
#10 0x000055cfab45519d in XLogPrefetcherNextBlock
(pgsr_private=94350447959632, lsn=0x55cfacf4bc58)
    at xlogprefetcher.c:496
#11 0x000055cfab454b1c in lrq_prefetch (lrq=0x55cfacf4bb08) at
xlogprefetcher.c:256
#12 0x000055cfab454cdf in lrq_complete_lsn (lrq=0x55cfacf4bb08,
lsn=2976923632) at xlogprefetcher.c:294
#13 0x000055cfab455df2 in XLogPrefetcherReadRecord
(prefetcher=0x55cfacef5e50, errmsg=0x7ffc2e410458)
    at xlogprefetcher.c:1039
#14 0x000055cfab45f136 in ReadRecord (xlogprefetcher=0x55cfacef5e50,
emode=15, fetching_ckpt=false, 
    replayTLI=1) at xlogrecovery.c:3047
#15 0x000055cfab45c922 in PerformWalRecovery () at xlogrecovery.c:1754
#16 0x000055cfab448869 in StartupXLOG () at xlog.c:5300
#17 0x000055cfab7d151a in StartupProcessMain () at startup.c:267
#18 0x000055cfab7c3f80 in AuxiliaryProcessMain (auxtype=StartupProcess) at
auxprocess.c:141
#19 0x000055cfab7cfb4b in StartChildProcess (type=StartupProcess) at
postmaster.c:5429
#20 0x000055cfab7ca5c0 in PostmasterMain (argc=4, argv=0x55cfacef3e40) at
postmaster.c:1470
#21 0x000055cfab6be302 in main (argc=4, argv=0x55cfacef3e40) at main.c:202

(gdb) frame 8
#8  0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
689             decoded = XLogReadRecordAlloc(state,
(gdb) print/x *record
$5 = {xl_tot_len = 0x78787878, xl_xid = 0x78787878, xl_prev =
0x7878787878787878, xl_info = 0x0, 
  xl_rmid = 0x10, xl_crc = 0x0}
(gdb) print/x state->NextRecPtr
$6 = 0xb1703ff0

hexdump -C .../standby_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56  00 00 00 00 00 00 00 00 ...@...V........
00703ff0 78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
00704000 10 d1 05 00 01 00 00 00  00 40 70 30 00 00 00 00 .........@p0....
00704010 db 00 00 00 00 00 00 00  78 78 78 78 78 78 78 78 ........xxxxxxxx
00704020 78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
007040f0 09 00 00 00 00 00 00 00  33 02 00 00 1b 0d 00 00 ........3.......

hexdump -C .../primary_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56  00 00 00 00 00 00 00 00 ...@...V........
00703ff0 33 02 00 00 54 2a 00 00  e0 3e 70 b1 00 00 00 00 3...T*...>p.....
00704000 10 d1 05 00 01 00 00 00  00 40 70 b1 00 00 00 00 .........@p.....
00704010 23 02 00 00 00 00 00 00  80 0a 00 00 55 0f 07 2c #...........U..,
00704020 00 60 02 02 7f 06 00 00  05 00 00 00 f7 85 01 00 .`..............
00704030 00 00 00 00 ff 03 02 00  02 08 18 00 01 00 00 00 ................
00704040 e0 07 00 00 78 78 78 78  78 78 78 78 78 78 78 78 ....xxxxxxxxxxxx
00704050 78 78 78 78 78 78 78 78  78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
00704230 78 78 78 78 78 78 78 78  01 00 00 00 00 00 00 00 xxxxxxxx........

(In this case, rows with 500 'x' were inserted when the primary was
stopped.)

`git bisect` for this behavior blames 3f1ce9734 (where
XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).

A reproducer for the anomaly to follow.


pgsql-bugs by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: BUG #17927: Postgres does not compile with LLVM-16
Next
From: Alvaro Herrera
Date:
Subject: Re: BUG #17926: Segfault in SELECT