On Tue, Oct 17, 2023 at 02:09:21PM +0000, Zakhlystov, Daniil (Nebius) wrote:
> I've stumbled into an interesting problem. Currently, if Postgres
> has nothing to write, it would skip the checkpoint creation defined
> by the checkpoint timeout setting. However, we might face a
> temporary archiving problem (for example, some network issues) that
> might lead to a pile of wal files stuck in pg_wal. After this
> temporary issue has gone, we would still be unable to archive them
> since we effectively skip the checkpoint because we have nothing to
> write.
I am not sure to understand your last sentence here. Once the
archiver is back up, you mean that the WAL segments that were not
previously archived still are still not archived? Or do you mean that
because of a succession of checkpoint skipped we are just enable to
remove them from pg_wal.
> That might lead to a problem - suppose you've run out of disk space
> because of the temporary failure of the archiver. After this
> temporary failure has gone, Postgres would be unable to recover from
> it automatically and will require human attention to initiate a
> CHECKPOINT call.
>
> I suggest changing this behavior by trying to clean up the old WAL
> even if we skip the main checkpoint routine. I've attached the patch
> that does exactly that.
>
> What do you think?
I am not convinced that this is worth the addition in the skipped
path. If your system is idle and a set of checkpoints is skipped, the
data folder is not going to be under extra space pressure because of
database activity (okay, unlogged tables even if these would generate
some WAL for init pages), because there is nothing happening in it
with no "important" WAL generated. Note that the backend is very
unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.
More to the point: what's the origin of the disk space issues? System
logs, unlogged tables or something else? It is usually a good
practice to move logs to a different partition. At the end, it sounds
to me that removing segments more aggressively is just kicking the can
elsewhere, without taking care of the origin of the disk issues.
--
Michael