019_replslot_limit.pl might fail due to checkpoint skipped - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | 019_replslot_limit.pl might fail due to checkpoint skipped |
Date | |
Msg-id | 0b07ead5-a5da-445e-9698-a7d340708bdf@gmail.com Whole thread Raw |
List | pgsql-hackers |
Hello hackers, Investigating a recent buildfarm failure [1] with the following diagnostics: [12:27:41.437](0.024s) ok 18 - have walreceiver pid 637143 [12:30:42.564](181.127s) not ok 19 - walsender termination logged [12:30:42.564](0.000s) [12:30:42.564](0.000s) # Failed test 'walsender termination logged' # at t/019_replslot_limit.pl line 382. 019_replslot_limit_primary3.log: 2024-12-13 12:27:40.912 ACDT [637093:7] LOG: checkpoint starting: wal ... 2024-12-13 12:27:41.461 ACDT [637182:4] 019_replslot_limit.pl LOG: statement: SELECT pg_logical_emit_message(false, '', 'foo'); 2024-12-13 12:27:41.462 ACDT [637182:5] 019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal(); 2024-12-13 12:27:41.463 ACDT [637182:6] 019_replslot_limit.pl LOG: disconnection: session time: 0:00:00.003 user=postgres database=postgres host=[local] 2024-12-13 12:27:41.668 ACDT [637093:8] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.756 s; sync files=0, longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB; lsn=0/900060, redo lsn=0/700028 2024-12-13 12:27:41.668 ACDT [637093:9] LOG: checkpoints are occurring too frequently (1 second apart) 2024-12-13 12:27:41.668 ACDT [637093:10] HINT: Consider increasing the configuration parameter "max_wal_size". 2024-12-13 12:30:42.565 ACDT [637144:10] standby_3 LOG: terminating walsender process due to replication timeout ### waiting for walsender termination logged timed out after 3 minuts 2024-12-13 12:30:42.565 ACDT [637144:11] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 (note that checkpoint was started at 40.912 and completed at 41.668, i.e., it took 0.7 seconds) I've found that such a failure can occur when the checkpoint, which should invalidate replication slot (through the call chain: CreateCheckPoint() -> InvalidateObsoleteReplicationSlots() -> InvalidatePossiblyObsoleteSlot()), is skipped. I've managed to reproduce this with the following sleep() added inside src/backend/access/transam/xlog.c: @@ -7126,6 +7126,11 @@ CreateCheckPoint(int flags) * panic. Accordingly, exit critical section while doing it. */ END_CRIT_SECTION(); +if (rand() % 2 == 0) +{ +elog(LOG, "!!!CreateCheckPoint| sleeping..."); +pg_usleep(1000000); +} When running just make -s check -C src/test/recovery/ PROVE_TESTS="t/019*" in a loop. I got failures on iterations 3, 4, 13, so with debug logging enabled, I can see the following: [12:57:41.899](0,009s) ok 18 - have walreceiver pid 509480 [13:00:44.862](182,963s) not ok 19 - walsender termination logged [13:00:44.862](0,000s) [13:00:44.862](0,000s) # Failed test 'walsender termination logged' # at t/019_replslot_limit.pl line 382. 019_replslot_limit_primary3.log: 2024-12-17 12:57:41.695 UTC [][509460:11][checkpointer] LOG: checkpoints are occurring too frequently (0 seconds apart) 2024-12-17 12:57:41.695 UTC [][509460:12][checkpointer] HINT: Consider increasing the configuration parameter "max_wal_size". 2024-12-17 12:57:41.695 UTC [][509460:13][checkpointer] LOG: checkpoint starting: wal 2024-12-17 12:57:41.695 UTC [][509460:14][checkpointer] LOG: !!!CreateCheckPoint| sleeping... ... 2024-12-17 12:57:41.903 UTC [][509460:15][checkpointer] DEBUG: performing replication slot checkpoint ... 2024-12-17 12:57:41.907 UTC [postgres][509495:5][client backend] [019_replslot_limit.pl] LOG: statement: SELECT pg_logical_emit_message(false, '', 'foo'); 2024-12-17 12:57:41.908 UTC [postgres][509495:6][client backend] [019_replslot_limit.pl] LOG: statement: SELECT pg_switch_wal(); ... 2024-12-17 12:57:41.947 UTC [][509460:16][checkpointer] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/C00060 oldest xid 742 latest complete 741 next xid 742) 2024-12-17 12:57:41.947 UTC [][509460:17][checkpointer] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000009 2024-12-17 12:57:41.947 UTC [][509460:18][checkpointer] DEBUG: removing write-ahead log file "000000010000000000000009" 2024-12-17 12:57:41.948 UTC [][509460:19][checkpointer] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000 2024-12-17 12:57:41.948 UTC [][509460:20][checkpointer] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.253 s; sync files=0, longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB; lsn=0/C00060, redo lsn=0/A00028 ### the checkpoint, that started before pg_switch_wal(), is completed 2024-12-17 12:57:41.948 UTC [][509460:21][checkpointer] LOG: checkpoints are occurring too frequently (0 seconds apart) 2024-12-17 12:57:41.948 UTC [][509460:22][checkpointer] HINT: Consider increasing the configuration parameter "max_wal_size". 2024-12-17 12:57:41.948 UTC [][509460:23][checkpointer] DEBUG: checkpoint skipped because system is idle ### the checkpoint, which expected to invalidate the replication slot after pg_switch_wal, is skipped Reproduced starting from commit 24043c27b (dated 2021-07-23), which added this test case. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-13%2001%3A24%3A58 Best regards, Alexander
pgsql-hackers by date: