Hello hackers,
Investigating a recent buildfarm failure [1] with the following
diagnostics:
[12:27:41.437](0.024s) ok 18 - have walreceiver pid 637143
[12:30:42.564](181.127s) not ok 19 - walsender termination logged
[12:30:42.564](0.000s)
[12:30:42.564](0.000s) # Failed test 'walsender termination logged'
# at t/019_replslot_limit.pl line 382.
019_replslot_limit_primary3.log:
2024-12-13 12:27:40.912 ACDT [637093:7] LOG: checkpoint starting: wal
...
2024-12-13 12:27:41.461 ACDT [637182:4] 019_replslot_limit.pl LOG: statement: SELECT pg_logical_emit_message(false, '',
'foo');
2024-12-13 12:27:41.462 ACDT [637182:5] 019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal();
2024-12-13 12:27:41.463 ACDT [637182:6] 019_replslot_limit.pl LOG: disconnection: session time: 0:00:00.003
user=postgres database=postgres host=[local]
2024-12-13 12:27:41.668 ACDT [637093:8] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1
removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.756 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=1024 kB, estimate=1024 kB; lsn=0/900060, redo lsn=0/700028
2024-12-13 12:27:41.668 ACDT [637093:9] LOG: checkpoints are occurring too frequently (1 second apart)
2024-12-13 12:27:41.668 ACDT [637093:10] HINT: Consider increasing the configuration parameter "max_wal_size".
2024-12-13 12:30:42.565 ACDT [637144:10] standby_3 LOG: terminating walsender process due to replication timeout
### waiting for walsender termination logged timed out after 3 minuts
2024-12-13 12:30:42.565 ACDT [637144:11] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
(note that checkpoint was started at 40.912 and completed at 41.668, i.e.,
it took 0.7 seconds)
I've found that such a failure can occur when the checkpoint, which should
invalidate replication slot (through the call chain: CreateCheckPoint() ->
InvalidateObsoleteReplicationSlots() -> InvalidatePossiblyObsoleteSlot()),
is skipped.
I've managed to reproduce this with the following sleep() added inside
src/backend/access/transam/xlog.c:
@@ -7126,6 +7126,11 @@ CreateCheckPoint(int flags)
* panic. Accordingly, exit critical section while doing it.
*/
END_CRIT_SECTION();
+if (rand() % 2 == 0)
+{
+elog(LOG, "!!!CreateCheckPoint| sleeping...");
+pg_usleep(1000000);
+}
When running just
make -s check -C src/test/recovery/ PROVE_TESTS="t/019*"
in a loop. I got failures on iterations 3, 4, 13, so with debug logging
enabled, I can see the following:
[12:57:41.899](0,009s) ok 18 - have walreceiver pid 509480
[13:00:44.862](182,963s) not ok 19 - walsender termination logged
[13:00:44.862](0,000s)
[13:00:44.862](0,000s) # Failed test 'walsender termination logged'
# at t/019_replslot_limit.pl line 382.
019_replslot_limit_primary3.log:
2024-12-17 12:57:41.695 UTC [][509460:11][checkpointer] LOG: checkpoints are occurring too frequently (0 seconds
apart)
2024-12-17 12:57:41.695 UTC [][509460:12][checkpointer] HINT: Consider increasing the configuration parameter
"max_wal_size".
2024-12-17 12:57:41.695 UTC [][509460:13][checkpointer] LOG: checkpoint starting: wal
2024-12-17 12:57:41.695 UTC [][509460:14][checkpointer] LOG: !!!CreateCheckPoint| sleeping...
...
2024-12-17 12:57:41.903 UTC [][509460:15][checkpointer] DEBUG: performing replication slot checkpoint
...
2024-12-17 12:57:41.907 UTC [postgres][509495:5][client backend] [019_replslot_limit.pl] LOG: statement: SELECT
pg_logical_emit_message(false, '', 'foo');
2024-12-17 12:57:41.908 UTC [postgres][509495:6][client backend] [019_replslot_limit.pl] LOG: statement: SELECT
pg_switch_wal();
...
2024-12-17 12:57:41.947 UTC [][509460:16][checkpointer] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/C00060
oldest xid 742 latest complete 741 next xid 742)
2024-12-17 12:57:41.947 UTC [][509460:17][checkpointer] DEBUG: attempting to remove WAL segments older than log file
000000000000000000000009
2024-12-17 12:57:41.947 UTC [][509460:18][checkpointer] DEBUG: removing write-ahead log file
"000000010000000000000009"
2024-12-17 12:57:41.948 UTC [][509460:19][checkpointer] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2024-12-17 12:57:41.948 UTC [][509460:20][checkpointer] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU
buffers; 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.253 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB; lsn=0/C00060, redo lsn=0/A00028
### the checkpoint, that started before pg_switch_wal(), is completed
2024-12-17 12:57:41.948 UTC [][509460:21][checkpointer] LOG: checkpoints are occurring too frequently (0 seconds
apart)
2024-12-17 12:57:41.948 UTC [][509460:22][checkpointer] HINT: Consider increasing the configuration parameter
"max_wal_size".
2024-12-17 12:57:41.948 UTC [][509460:23][checkpointer] DEBUG: checkpoint skipped because system is idle
### the checkpoint, which expected to invalidate the replication slot after pg_switch_wal, is skipped
Reproduced starting from commit 24043c27b (dated 2021-07-23), which added
this test case.
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-13%2001%3A24%3A58
Best regards,
Alexander