Melanie Plageman <melanieplageman@gmail.com> writes:
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.
It looks to me like it's being triggered by this questionable bit in
046_checkpoint_logical_slot.pl:
# Continue the checkpoint.
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
# Abruptly stop the server (1 second should be enough for the checkpoint
# to finish; it would be better).
$node->stop('immediate');
That second comment is pretty unintelligible, but I think it's
expecting that we'd give the checkpoint 1 second to complete,
which the code is *not* doing. On my own machine it looks like the
checkpoint does manage to complete within about 1ms, just barely
before the shutdown arrives:
2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG: statement: select
pg_replication_slot_advance('slot_physical',pg_current_wal_lsn())
2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG: statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 17:52:25.603 EDT [2538557] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s)added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s,
average=0.000s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
2025-06-20 17:52:25.604 EDT [2538553] LOG: received immediate shutdown request
But in the buildfarm failures I don't see any 'checkpoint complete'
before the shutdown.
If this is an accurate diagnosis then it indicates both a test bug
(it should delay here, or else the comment needs fixed to explain
what we're actually testing) and a backend bug, because an immediate
stop a/k/a crash before completing the checkpoint should not lead to
failure to function after the next restart.
regards, tom lane