Hi, hackers!
I've stumbled into an interesting problem. Currently, if Postgres has nothing to write, it would skip the checkpoint
creationdefined by the checkpoint timeout setting. However, we might face a temporary archiving problem (for example,
somenetwork issues) that might lead to a pile of wal files stuck in pg_wal. After this temporary issue has gone, we
wouldstill be unable to archive them since we effectively skip the checkpoint because we have nothing to write.
That might lead to a problem - suppose you've run out of disk space because of the temporary failure of the archiver.
Afterthis temporary failure has gone, Postgres would be unable to recover from it automatically and will require human
attentionto initiate a CHECKPOINT call.
I suggest changing this behavior by trying to clean up the old WAL even if we skip the main checkpoint routine. I've
attachedthe patch that does exactly that.
What do you think?
To reproduce the issue, you might repeat the following steps:
1. Init Postgres:
pg_ctl initdb -D /Users/usernamedt/test_archiver
2. Add the archiver script to simulate failure:
➜ ~ cat /Users/usernamedt/command.sh
#!/bin/bash
false
3. Then alter the PostgreSQL conf:
archive_mode = on
checkpoint_timeout = 30s
archive_command = /Users/usernamedt/command.sh
log_min_messages = debug1
4. Then start Postgres:
/usr/local/pgsql/bin/pg_ctl -D /Users/usernamedt/test_archiver -l logfile start
5. Insert some data:
pgbench -i -s 30 -d postgres
6. Trigger checkpoint to flush all data:
psql -c "checkpoint;"
7. Alter the archiver script to simulate the end of archiver issues:
➜ ~ cat /Users/usernamedt/command.sh
#!/bin/bash
true
8. Check that the WAL files are actually archived but not removed:
➜ ~ ls -lha /Users/usernamedt/test_archiver/pg_wal/archive_status | head
total 0
drwx------@ 48 usernamedt LD\Domain Users 1.5K Oct 17 17:44 .
drwx------@ 50 usernamedt LD\Domain Users 1.6K Oct 17 17:43 ..
-rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:42 000000010000000000000040.done
...
-rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:43 00000001000000000000006D.done
2023-10-17 18:03:44.621 +04 [71737] DEBUG: checkpoint skipped because system is idle
Thanks,
Daniil Zakhlystov