On Wed, Dec 29, 2021 at 7:27 PM Stephen Frost <sfrost@snowman.net> wrote:
> > On Thu, Dec 23, 2021, at 9:58 AM, Bharath Rupireddy wrote:
> > > pg_archivecleanup currently takes a WAL file name as input to delete
> > > the WAL files prior to it [1]. As suggested by Satya (cc-ed) in
> > > pg_replslotdata thread [2], can we enhance the pg_archivecleanup to
> > > automatically detect the last checkpoint (from control file) LSN,
> > > calculate the lowest restart_lsn required by the replication slots, if
> > > any (by reading the replication slot info from pg_logical directory),
> > > archive the unneeded (an archive_command similar to that of the one
> > > provided in the server config can be provided as an input) WAL files
> > > before finally deleting them? Making pg_archivecleanup tool as an
> > > end-to-end solution will help greatly in disk full situations because
> > > of WAL files growth (inactive replication slots, archive command
> > > failures, infrequent checkpoint etc.).
>
> The overall idea of having a tool for this isn't a bad idea, but ..
>
> > pg_archivecleanup is a tool to remove WAL files from the *archive*. Are you
> > suggesting to use it for removing files from pg_wal directory too? No, thanks.
>
> We definitely shouldn't have it be part of pg_archivecleanup for the
> simple reason that it'll be really confusing and almost certainly will
> be mis-used.
+1
> > WAL files are a key component for backup and replication. Hence, you cannot
> > deliberately allow a tool to remove WAL files from PGDATA. IMO this issue
> > wouldn't occur if you have a monitoring system and alerts and someone to keep
> > an eye on it. If the disk full situation was caused by a failed archive command
> > or a disconnected standby, it is easy to figure out; the fix is simple.
>
> This is perhaps a bit far- PG does, in fact, remove WAL files from
> PGDATA. Having a tool which will do this safely when the server isn't
> able to be brought online due to lack of disk space would certainly be
> helpful rather frequently. I agree that monitoring and alerting are
> things that everyone should implement and pay attention to, but that
> doesn't happen and instead people end up just blowing away pg_wal and
> corrupting their database when, had a tool existed, they could have
> avoided that happening and brought the system back online in relatively
> short order without any data loss.
Thanks. Yes, the end-to-end tool is helpful in rather eventual
situations and having it in the core is more helpful instead of every
postgres vendor developing their own solution and many times it's hard
to get it right. Also, I agree to not club this idea with
pg_archviecleanup. How about having a new tool like
pg_walcleanup/pg_xlogcleanup helping the developers/admins/users in
eventual situations?
Regards,
Bharath Rupireddy.