BUG #18426: Canceling vacuum while truncating a relation leads to standby PANIC - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18426: Canceling vacuum while truncating a relation leads to standby PANIC
Date
Msg-id 18426-2d18da6586f152d6@postgresql.org
Whole thread Raw
Responses Re: BUG #18426: Canceling vacuum while truncating a relation leads to standby PANIC
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18426
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.2
Operating system:   Ubuntu 22.04
Description:

When running test 027_stream_regress on a rather slow ARMv7 device, I
came upon the following test failure:

[21:31:30.789](489.962s) # poll_query_until timed out executing this
query:
# SELECT '0/16952DE0' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# 
# with stderr:
timed out waiting for catchup at t/027_stream_regress.pl line 100.

log/027_stream_regress_standby_1.log ends with:
...
2024-04-04 21:04:01.946 UTC [17903] LOG:  recovery restart point at
0/11C71430
2024-04-04 21:04:01.946 UTC [17903] DETAIL:  Last completed transaction was
at log time 2024-04-04 21:02:19.456471+03.
2024-04-04 21:06:16.014 UTC [17905] WARNING:  page 7 of relation
base/16384/2606 does not exist
2024-04-04 21:06:16.014 UTC [17905] CONTEXT:  WAL redo at 0/13841DE8 for
Heap2/PRUNE: snapshotConflictHorizon: 13050, nredirected: 0, ndead: 14,
nunused: 1, redirected: [], dead: [1, 2, 3, 4, 5, 6, 28, 31, 32, 33, 34, 35,
36, 37], unused: [7]; blkref #0: rel 1663/16384/2606, blk 7
2024-04-04 21:06:16.014 UTC [17905] PANIC:  WAL contains references to
invalid pages
2024-04-04 21:06:16.014 UTC [17905] CONTEXT:  WAL redo at 0/13841DE8 for
Heap2/PRUNE: snapshotConflictHorizon: 13050, nredirected: 0, ndead: 14,
nunused: 1, redirected: [], dead: [1, 2, 3, 4, 5, 6, 28, 31, 32, 33, 34, 35,
36, 37], unused: [7]; blkref #0: rel 1663/16384/2606, blk 7
2024-04-04 21:06:16.927 UTC [17888] LOG:  startup process (PID 17905) was
terminated by signal 6: Aborted

t_027_stream_regress_standby_1_data/pgdata/pg_wal/000000010000000000000013
contains:
...
rmgr: XLOG        len (rec/tot):     49/  8241, tx:      13103, lsn:
0/137C3538, prev 0/137C14E8, desc: FPI , blkref #0: rel 1663/16384/2606 fork
vm blk 0 FPW
rmgr: Storage     len (rec/tot):     46/    46, tx:      13103, lsn:
0/137C5588, prev 0/137C3538, desc: TRUNCATE base/16384/2606 to 7 blocks
flags 7
...
rmgr: Transaction len (rec/tot):     38/    38, tx:      13103, lsn:
0/137C7588, prev 0/137C7540, desc: ABORT 2024-04-04 21:05:31.176869 UTC
### That is, the relation 2606 (pg_constraint) was truncated in
### transaction 13103, and then the transaction was aborted.
...
rmgr: Heap2       len (rec/tot):     85/    85, tx:          0, lsn:
0/13841DE8, prev 0/13841DA8, desc: PRUNE snapshotConflictHorizon: 13050,
nredirected: 0, ndead: 14, nunused: 1, redirected: [], dead: [1, 2, 3, 4, 5,
6, 28, 31, 32, 33, 34, 35, 36, 37], unused: [7], blkref #0: rel
1663/16384/2606 blk 7
### The record that standby could not replay because it did truncate
### pg_constraint when replaying 0/137C5588.

027_stream_regress_primary.log  contains:
2024-04-04 21:05:31.176 UTC [29329] ERROR:  canceling autovacuum task
2024-04-04 21:05:31.176 UTC [29329] CONTEXT:  while truncating relation
"pg_catalog.pg_constraint" to 7 blocks
        automatic vacuum of table "regression.pg_catalog.pg_constraint"

As far as I can see, the TRUNCATE record was WAL-logged by the autovacuum
process in table_relation_vacuum() -> heap_vacuum_rel() ->
lazy_truncate_heap() -> RelationTruncate(), and then that process was
interrupted inside RelationTruncate() -> smgrtruncate() ->
DropRelationBuffers() -> InvalidateBuffer() -> WaitIO().

A demo test for the issue to follow...


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18425: KB5036892 Microsoft Update Package Issue
Next
From: Tom Lane
Date:
Subject: Re: BUG #18422: Assert in expandTupleDesc() fails on row mismatch with additional SRF