Thread: 019_replslot_limit.pl might fail due to checkpoint skipped

019_replslot_limit.pl might fail due to checkpoint skipped

From
Alexander Lakhin
Date:
17 December 2024, 17:00:01
Hello hackers,

Investigating a recent buildfarm failure [1] with the following
diagnostics:
[12:27:41.437](0.024s) ok 18 - have walreceiver pid 637143
[12:30:42.564](181.127s) not ok 19 - walsender termination logged
[12:30:42.564](0.000s)
[12:30:42.564](0.000s) #   Failed test 'walsender termination logged'
#   at t/019_replslot_limit.pl line 382.

019_replslot_limit_primary3.log:
2024-12-13 12:27:40.912 ACDT [637093:7] LOG:  checkpoint starting: wal
...
2024-12-13 12:27:41.461 ACDT [637182:4] 019_replslot_limit.pl LOG: statement: SELECT pg_logical_emit_message(false, '',

'foo');
2024-12-13 12:27:41.462 ACDT [637182:5] 019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal();
2024-12-13 12:27:41.463 ACDT [637182:6] 019_replslot_limit.pl LOG: disconnection: session time: 0:00:00.003 
user=postgres database=postgres host=[local]
2024-12-13 12:27:41.668 ACDT [637093:8] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 
removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.756 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=1024 kB, estimate=1024 kB; lsn=0/900060, redo lsn=0/700028
2024-12-13 12:27:41.668 ACDT [637093:9] LOG:  checkpoints are occurring too frequently (1 second apart)
2024-12-13 12:27:41.668 ACDT [637093:10] HINT:  Consider increasing the configuration parameter "max_wal_size".
2024-12-13 12:30:42.565 ACDT [637144:10] standby_3 LOG:  terminating walsender process due to replication timeout
### waiting for walsender termination logged timed out after 3 minuts
2024-12-13 12:30:42.565 ACDT [637144:11] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
(note that checkpoint was started at 40.912 and completed at 41.668, i.e.,
it took 0.7 seconds)

I've found that such a failure can occur when the checkpoint, which should
invalidate replication slot (through the call chain: CreateCheckPoint() ->
InvalidateObsoleteReplicationSlots() -> InvalidatePossiblyObsoleteSlot()),
is skipped.

I've managed to reproduce this with the following sleep() added inside
src/backend/access/transam/xlog.c:
@@ -7126,6 +7126,11 @@ CreateCheckPoint(int flags)
          * panic. Accordingly, exit critical section while doing it.
          */
         END_CRIT_SECTION();
+if (rand() % 2 == 0)
+{
+elog(LOG, "!!!CreateCheckPoint| sleeping...");
+pg_usleep(1000000);
+}

When running just
make -s check -C src/test/recovery/ PROVE_TESTS="t/019*"
in a loop. I got failures on iterations 3, 4, 13, so with debug logging
enabled, I can see the following:
[12:57:41.899](0,009s) ok 18 - have walreceiver pid 509480
[13:00:44.862](182,963s) not ok 19 - walsender termination logged
[13:00:44.862](0,000s)
[13:00:44.862](0,000s) #   Failed test 'walsender termination logged'
#   at t/019_replslot_limit.pl line 382.

019_replslot_limit_primary3.log:
2024-12-17 12:57:41.695 UTC [][509460:11][checkpointer] LOG: checkpoints are occurring too frequently (0 seconds
apart)
2024-12-17 12:57:41.695 UTC [][509460:12][checkpointer] HINT: Consider increasing the configuration parameter 
"max_wal_size".
2024-12-17 12:57:41.695 UTC [][509460:13][checkpointer] LOG: checkpoint starting: wal
2024-12-17 12:57:41.695 UTC [][509460:14][checkpointer] LOG: !!!CreateCheckPoint| sleeping...
...
2024-12-17 12:57:41.903 UTC [][509460:15][checkpointer] DEBUG: performing replication slot checkpoint
...
2024-12-17 12:57:41.907 UTC [postgres][509495:5][client backend] [019_replslot_limit.pl] LOG:  statement: SELECT 
pg_logical_emit_message(false, '', 'foo');
2024-12-17 12:57:41.908 UTC [postgres][509495:6][client backend] [019_replslot_limit.pl] LOG:  statement: SELECT 
pg_switch_wal();
...
2024-12-17 12:57:41.947 UTC [][509460:16][checkpointer] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/C00060 
oldest xid 742 latest complete 741 next xid 742)
2024-12-17 12:57:41.947 UTC [][509460:17][checkpointer] DEBUG: attempting to remove WAL segments older than log file 
000000000000000000000009
2024-12-17 12:57:41.947 UTC [][509460:18][checkpointer] DEBUG: removing write-ahead log file
"000000010000000000000009"
2024-12-17 12:57:41.948 UTC [][509460:19][checkpointer] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2024-12-17 12:57:41.948 UTC [][509460:20][checkpointer] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU

buffers; 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.253 s; sync files=0, 
longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB; lsn=0/C00060, redo lsn=0/A00028
     ### the checkpoint, that started before pg_switch_wal(), is completed
2024-12-17 12:57:41.948 UTC [][509460:21][checkpointer] LOG: checkpoints are occurring too frequently (0 seconds
apart)
2024-12-17 12:57:41.948 UTC [][509460:22][checkpointer] HINT: Consider increasing the configuration parameter 
"max_wal_size".
2024-12-17 12:57:41.948 UTC [][509460:23][checkpointer] DEBUG: checkpoint skipped because system is idle
     ### the checkpoint, which expected to invalidate the replication slot after pg_switch_wal, is skipped

Reproduced starting from commit 24043c27b (dated 2021-07-23), which added
this test case.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-13%2001%3A24%3A58

Best regards,
Alexander