Thread: BUG #18426: Canceling vacuum while truncating a relation leads to standby PANIC

BUG #18426: Canceling vacuum while truncating a relation leads to standby PANIC

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18426
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.2
Operating system:   Ubuntu 22.04
Description:

When running test 027_stream_regress on a rather slow ARMv7 device, I
came upon the following test failure:

[21:31:30.789](489.962s) # poll_query_until timed out executing this
query:
# SELECT '0/16952DE0' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# 
# with stderr:
timed out waiting for catchup at t/027_stream_regress.pl line 100.

log/027_stream_regress_standby_1.log ends with:
...
2024-04-04 21:04:01.946 UTC [17903] LOG:  recovery restart point at
0/11C71430
2024-04-04 21:04:01.946 UTC [17903] DETAIL:  Last completed transaction was
at log time 2024-04-04 21:02:19.456471+03.
2024-04-04 21:06:16.014 UTC [17905] WARNING:  page 7 of relation
base/16384/2606 does not exist
2024-04-04 21:06:16.014 UTC [17905] CONTEXT:  WAL redo at 0/13841DE8 for
Heap2/PRUNE: snapshotConflictHorizon: 13050, nredirected: 0, ndead: 14,
nunused: 1, redirected: [], dead: [1, 2, 3, 4, 5, 6, 28, 31, 32, 33, 34, 35,
36, 37], unused: [7]; blkref #0: rel 1663/16384/2606, blk 7
2024-04-04 21:06:16.014 UTC [17905] PANIC:  WAL contains references to
invalid pages
2024-04-04 21:06:16.014 UTC [17905] CONTEXT:  WAL redo at 0/13841DE8 for
Heap2/PRUNE: snapshotConflictHorizon: 13050, nredirected: 0, ndead: 14,
nunused: 1, redirected: [], dead: [1, 2, 3, 4, 5, 6, 28, 31, 32, 33, 34, 35,
36, 37], unused: [7]; blkref #0: rel 1663/16384/2606, blk 7
2024-04-04 21:06:16.927 UTC [17888] LOG:  startup process (PID 17905) was
terminated by signal 6: Aborted

t_027_stream_regress_standby_1_data/pgdata/pg_wal/000000010000000000000013
contains:
...
rmgr: XLOG        len (rec/tot):     49/  8241, tx:      13103, lsn:
0/137C3538, prev 0/137C14E8, desc: FPI , blkref #0: rel 1663/16384/2606 fork
vm blk 0 FPW
rmgr: Storage     len (rec/tot):     46/    46, tx:      13103, lsn:
0/137C5588, prev 0/137C3538, desc: TRUNCATE base/16384/2606 to 7 blocks
flags 7
...
rmgr: Transaction len (rec/tot):     38/    38, tx:      13103, lsn:
0/137C7588, prev 0/137C7540, desc: ABORT 2024-04-04 21:05:31.176869 UTC
### That is, the relation 2606 (pg_constraint) was truncated in
### transaction 13103, and then the transaction was aborted.
...
rmgr: Heap2       len (rec/tot):     85/    85, tx:          0, lsn:
0/13841DE8, prev 0/13841DA8, desc: PRUNE snapshotConflictHorizon: 13050,
nredirected: 0, ndead: 14, nunused: 1, redirected: [], dead: [1, 2, 3, 4, 5,
6, 28, 31, 32, 33, 34, 35, 36, 37], unused: [7], blkref #0: rel
1663/16384/2606 blk 7
### The record that standby could not replay because it did truncate
### pg_constraint when replaying 0/137C5588.

027_stream_regress_primary.log  contains:
2024-04-04 21:05:31.176 UTC [29329] ERROR:  canceling autovacuum task
2024-04-04 21:05:31.176 UTC [29329] CONTEXT:  while truncating relation
"pg_catalog.pg_constraint" to 7 blocks
        automatic vacuum of table "regression.pg_catalog.pg_constraint"

As far as I can see, the TRUNCATE record was WAL-logged by the autovacuum
process in table_relation_vacuum() -> heap_vacuum_rel() ->
lazy_truncate_heap() -> RelationTruncate(), and then that process was
interrupted inside RelationTruncate() -> smgrtruncate() ->
DropRelationBuffers() -> InvalidateBuffer() -> WaitIO().

A demo test for the issue to follow...


10.04.2024 14:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      18426
> ...
> A demo test for the issue to follow...

Please try the attached patch (for REL_16_STABLE), which:
1) adds a delay inside InvalidateBuffer() to allow for canceling the
   operation here.
2) invokes ProcessInterrupts() which would be called if
   (buf_state & BM_IO_IN_PROGRESS) inside ConditionVariableSleep() ->
   ConditionVariableTimedSleep().
(If you find these changes non-legitimate, I'll look for another approach.)

3) contains test 099_stream_vacuum, based on 027_stream_regress.

make -s check -C src/test/recovery/ PROVE_TESTS="t/099*"
fails for me (roughly on 1 of 2 runs) just like in the case observed in
the wild:
...
2024-04-10 17:20:54.521 UTC|||6616ca5a.26a393|WARNING:  page 90 of relation base/16402/16929 does not exist
2024-04-10 17:20:54.521 UTC|||6616ca5a.26a393|CONTEXT:  WAL redo at 0/DCD12AC0 for Heap2/VISIBLE: 
snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel 1663/16402/16929, fork 2, blk 0; blkref #1: rel 
1663/16402/16929, blk 90
2024-04-10 17:20:54.521 UTC|||6616ca5a.26a393|PANIC:  WAL contains references to invalid pages
2024-04-10 17:20:54.521 UTC|||6616ca5a.26a393|CONTEXT:  WAL redo at 0/DCD12AC0 for Heap2/VISIBLE: 
snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel 1663/16402/16929, fork 2, blk 0; blkref #1: rel 
1663/16402/16929, blk 90
2024-04-10 17:20:54.532 UTC|||6616ca5a.26a390|LOG:  startup process (PID 2532243) was terminated by signal 6: Aborted
...

Best regards,
Alexander
Attachment
On Thu, Apr 11, 2024 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> 10.04.2024 14:00, PG Bug reporting form wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference:      18426
> > ...
> > A demo test for the issue to follow...

I didn't try your test but your explanation seems clear.
RelationTruncate() logs first, then calls smgrtruncate() which drops
buffers and then truncates files.  The dropping-the-buffers phase is
now interruptible, since commit d87251048a0f.  If you interrupt it
there, the situation is bad: you have logged the truncation, but left
(1) buffers and (2) untruncated files on the primary.  Relation size
being out of sync is a recipe for that PANIC next time the WAL
mentions blocks past the (primary's) end.  First thought is that that
particular wait might need to hold interrupts.  Hmm.  The comments for
RelationTruncate() contemplate but reject a critical section.
Presumably it's waiting for another backend to flush data, and that
other backend will eventually finish doing that or fail/crash.



On Fri, Apr 12, 2024 at 6:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> mentions blocks past the (primary's) end.  First thought is that that

s/primary/standby/



On Fri, Apr 12, 2024 at 6:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Apr 11, 2024 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> > 10.04.2024 14:00, PG Bug reporting form wrote:
> > > The following bug has been logged on the website:
> > >
> > > Bug reference:      18426
> > > ...
> > > A demo test for the issue to follow...
>
> I didn't try your test but your explanation seems clear.
> RelationTruncate() logs first, then calls smgrtruncate() which drops
> buffers and then truncates files.  The dropping-the-buffers phase is
> now interruptible, since commit d87251048a0f.  If you interrupt it
> there, the situation is bad: you have logged the truncation, but left
> (1) buffers and (2) untruncated files on the primary.  Relation size
> being out of sync is a recipe for that PANIC next time the WAL
> mentions blocks past the (primary's) end.  First thought is that that
> particular wait might need to hold interrupts.  Hmm.  The comments for
> RelationTruncate() contemplate but reject a critical section.
> Presumably it's waiting for another backend to flush data, and that
> other backend will eventually finish doing that or fail/crash.

That surely needs fixing, but while thinking about the difference
between holding interrupts and declaring a critical section, I'm
wondering if the lack of the latter has other pre-existing nasty
failure modes:

1.  We throw away potentially dirty buffers, and then we ereport while
trying to truncate a file: now what stops some old ghost block
contents from coming back to life (read from disk in the untruncated
file)?
2.  We already told downstream servers to truncate.  Now the sizes are
out of sync, so what stops us logging more references to the ghost
pages and panicking replicas?  (Same as this interruption issue).



After looking into that a bit more and rediscovering/re-remembering
all the previous threads about RelationTruncate()'s failures, which
d87251048a0f has made worse, I now think we should do one of two
things:

1.  Fix bug #18146[1].  Then this WaitIO() problem will be fixed as a
side-effect.  I posted a new patch over there.

2.  If somehow we can't agree on that critical section in back
branches (for example if we finish up deciding to do the big
re-engineering project for truncation where the buffer pool has new
states, ie fix in HEAD only), then at least wrap the truncation in
HOLD_INTERRUPTS() for releases 14-17.

[1] https://www.postgresql.org/message-id/flat/18146-04e908c662113ad5%40postgresql.org