Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary
Date
Msg-id 20211229135710.GH15820@tamriel.snowman.net
Whole thread Raw
In response to Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary  ("Euler Taveira" <euler@eulerto.com>)
Responses Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary
List pgsql-hackers
Greetings,

* Euler Taveira (euler@eulerto.com) wrote:
> On Thu, Dec 23, 2021, at 9:58 AM, Bharath Rupireddy wrote:
> > pg_archivecleanup currently takes a WAL file name as input to delete
> > the WAL files prior to it [1]. As suggested by Satya (cc-ed) in
> > pg_replslotdata thread [2], can we enhance the pg_archivecleanup to
> > automatically detect the last checkpoint (from control file) LSN,
> > calculate the lowest restart_lsn required by the replication slots, if
> > any (by reading the replication slot info from pg_logical directory),
> > archive the unneeded (an archive_command similar to that of the one
> > provided in the server config can be provided as an input) WAL files
> > before finally deleting them? Making pg_archivecleanup tool as an
> > end-to-end solution will help greatly in disk full situations because
> > of WAL files growth (inactive replication slots, archive command
> > failures, infrequent checkpoint etc.).

The overall idea of having a tool for this isn't a bad idea, but ..

> pg_archivecleanup is a tool to remove WAL files from the *archive*. Are you
> suggesting to use it for removing files from pg_wal directory too? No, thanks.

We definitely shouldn't have it be part of pg_archivecleanup for the
simple reason that it'll be really confusing and almost certainly will
be mis-used.  For my 2c, we should just remove pg_archivecleanup
entirely.

> WAL files are a key component for backup and replication. Hence, you cannot
> deliberately allow a tool to remove WAL files from PGDATA. IMO this issue
> wouldn't occur if you have a monitoring system and alerts and someone to keep
> an eye on it. If the disk full situation was caused by a failed archive command
> or a disconnected standby, it is easy to figure out; the fix is simple.

This is perhaps a bit far- PG does, in fact, remove WAL files from
PGDATA.  Having a tool which will do this safely when the server isn't
able to be brought online due to lack of disk space would certainly be
helpful rather frequently.  I agree that monitoring and alerting are
things that everyone should implement and pay attention to, but that
doesn't happen and instead people end up just blowing away pg_wal and
corrupting their database when, had a tool existed, they could have
avoided that happening and brought the system back online in relatively
short order without any data loss.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Next
From: Nitin Jadhav
Date:
Subject: Re: Multi-Column List Partitioning