Thread: Force the old transactions logs cleanup even if checkpoint is skipped

Force the old transactions logs cleanup even if checkpoint is skipped

From

"Zakhlystov, Daniil (Nebius)"

Date:

17 October 2023, 14:09:21

Hi, hackers!

I've stumbled into an interesting problem. Currently, if Postgres has nothing to write, it would skip the checkpoint
creationdefined by the checkpoint timeout setting. However, we might face a temporary archiving problem (for example,
somenetwork issues) that might lead to a pile of wal files stuck in pg_wal. After this temporary issue has gone, we
wouldstill be unable to archive them since we effectively skip the checkpoint because we have nothing to write.

That might lead to a problem - suppose you've run out of disk space because of the temporary failure of the archiver.
Afterthis temporary failure has gone, Postgres would be unable to recover from it automatically and will require human
attentionto initiate a CHECKPOINT call.

I suggest changing this behavior by trying to clean up the old WAL even if we skip the main checkpoint routine. I've
attachedthe patch that does exactly that.

What do you think?

To reproduce the issue, you might repeat the following steps:

1. Init Postgres:
pg_ctl initdb -D /Users/usernamedt/test_archiver

2. Add the archiver script to simulate failure:
➜ ~ cat /Users/usernamedt/command.sh
#!/bin/bash

false

3. Then alter the PostgreSQL conf:

archive_mode = on
checkpoint_timeout = 30s
archive_command = /Users/usernamedt/command.sh
log_min_messages = debug1

4. Then start Postgres:
/usr/local/pgsql/bin/pg_ctl -D /Users/usernamedt/test_archiver -l logfile start

5. Insert some data:
pgbench -i -s 30 -d postgres

6. Trigger checkpoint to flush all data:
psql -c "checkpoint;"

7. Alter the archiver script to simulate the end of archiver issues:
➜ ~ cat /Users/usernamedt/command.sh
#!/bin/bash

true

8. Check that the WAL files are actually archived but not removed:
➜ ~ ls -lha /Users/usernamedt/test_archiver/pg_wal/archive_status | head
total 0
drwx------@ 48 usernamedt LD\Domain Users 1.5K Oct 17 17:44 .
drwx------@ 50 usernamedt LD\Domain Users 1.6K Oct 17 17:43 ..
-rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:42 000000010000000000000040.done
...
-rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:43 00000001000000000000006D.done

2023-10-17 18:03:44.621 +04 [71737] DEBUG: checkpoint skipped because system is idle

Thanks,

Daniil Zakhlystov

Attachment

0001-Cleanup-old-files-if-checkpoint-is-skipped.patch

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

Shlok Kyal

Date:

02 November 2023, 12:25:20

Hi,

I went through the Cfbot and saw that some test are failing for it
(link: https://cirrus-ci.com/task/4631357628874752):

test:         postgresql:recovery / recovery/019_replslot_limit

# test failed
----------------------------------- stderr -----------------------------------
# poll_query_until timed out executing this query:
# SELECT '0/15000D8' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 7.

I tried to test it locally and this test is timing out in my local
machine as well.

Thanks
Shlok Kumar Kyal

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

"Zakhlystov, Daniil (Nebius)"

Date:

07 November 2023, 09:43:46

Hi!

Thanks for your review. I've actually missed the logic to retain some WALs required for replication slots or
wal_keep_sizesetting. I've attached the updated version of the patch with an additional call to KeepLogSeg(). Now it
passedall the tests, at least in my fork (https://cirrus-ci.com/build/4770244019879936). 


Daniil Zakhlystov

________________________________________
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Sent: Thursday, November 2, 2023 1:25 PM
To: Zakhlystov, Daniil (Nebius)
Cc: amborodin@acm.org; pgsql-hackers@lists.postgresql.org; Mokrushin, Mikhail (Nebius)
Subject: Re: Force the old transactions logs cleanup even if checkpoint is skipped

CAUTION: This email originated from outside mail organization. Do not click links or open attachments unless you
recognizethe sender. 

Hi,

I went through the Cfbot and saw that some test are failing for it
(link: https://cirrus-ci.com/task/4631357628874752):

test:         postgresql:recovery / recovery/019_replslot_limit

# test failed
----------------------------------- stderr -----------------------------------
# poll_query_until timed out executing this query:
# SELECT '0/15000D8' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 7.

I tried to test it locally and this test is timing out in my local
machine as well.

Thanks
Shlok Kumar Kyal

Attachment

0001-Cleanup-old-files-if-checkpoint-is-skipped.patch

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

Michael Paquier

Date:

08 November 2023, 00:21:32

On Tue, Oct 17, 2023 at 02:09:21PM +0000, Zakhlystov, Daniil (Nebius) wrote:
> I've stumbled into an interesting problem. Currently, if Postgres
> has nothing to write, it would skip the checkpoint creation defined
> by the checkpoint timeout setting. However, we might face a
> temporary archiving problem (for example, some network issues) that
> might lead to a pile of wal files stuck in pg_wal. After this
> temporary issue has gone, we would still be unable to archive them
> since we effectively skip the checkpoint because we have nothing to
> write.

I am not sure to understand your last sentence here.  Once the
archiver is back up, you mean that the WAL segments that were not
previously archived still are still not archived?  Or do you mean that
because of a succession of checkpoint skipped we are just enable to
remove them from pg_wal.

> That might lead to a problem - suppose you've run out of disk space
> because of the temporary failure of the archiver. After this
> temporary failure has gone, Postgres would be unable to recover from
> it automatically and will require human attention to initiate a
> CHECKPOINT call.
>
> I suggest changing this behavior by trying to clean up the old WAL
> even if we skip the main checkpoint routine. I've attached the patch
> that does exactly that.
>
> What do you think?

I am not convinced that this is worth the addition in the skipped
path.  If your system is idle and a set of checkpoints is skipped, the
data folder is not going to be under extra space pressure because of
database activity (okay, unlogged tables even if these would generate
some WAL for init pages), because there is nothing happening in it
with no "important" WAL generated.  Note that the backend is very
unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.

More to the point: what's the origin of the disk space issues?  System
logs, unlogged tables or something else?  It is usually a good
practice to move logs to a different partition.  At the end, it sounds
to me that removing segments more aggressively is just kicking the can
elsewhere, without taking care of the origin of the disk issues.
--
Michael

Attachment

signature.asc

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

"Zakhlystov, Daniil (Nebius)"

Date:

08 November 2023, 12:44:09

Hi!

Thanks for your review.

> I am not sure to understand your last sentence here.  Once the
> archiver is back up, you mean that the WAL segments that were not
> previously archived still are still not archived?  Or do you mean that
> because of a succession of checkpoint skipped we are just enable to
> remove them from pg_wal.

Yes, the latter is correct - we are unable to clean up the already archived WALs
due to the checkpoint being skipped.

> I am not convinced that this is worth the addition in the skipped
> path.  If your system is idle and a set of checkpoints is skipped, the
> data folder is not going to be under extra space pressure because of
> database activity (okay, unlogged tables even if these would generate
> some WAL for init pages), because there is nothing happening in it
> with no "important" WAL generated.  Note that the backend is very
> unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.

> More to the point: what's the origin of the disk space issues?  System
> logs, unlogged tables or something else?  It is usually a good
> practice to move logs to a different partition.  At the end, it sounds
> to me that removing segments more aggressively is just kicking the can
> elsewhere, without taking care of the origin of the disk issues.

This problem arises when disk space issues are caused by temporary failed archiving.
As a result, the pg_wal becomes filled with WALs. This situation
leads to Postgres being unable to perform any write operations since there is no more
free disk space left. Usually, cloud providers switch the cluster to a Read-Only mode
if there is less than 3-4% of the available disk space left, but this also does not resolve
this problem.

The actual problem is that after archiving starts working normally again, Postgres is
unable to free the accumulated WAL and switch to Read-Write mode due to the
checkpoint being skipped, leading to a vicious circle. However, nothing prevents
Postgres from exiting such situations on its own. This patch addresses this specific
behavior, enabling Postgres to resolve such situations autonomously.

Thank you,

Daniil Zakhlystov

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

Michael Paquier

Date:

09 November 2023, 00:30:09

On Wed, Nov 08, 2023 at 12:44:09PM +0000, Zakhlystov, Daniil (Nebius) wrote:
>> I am not sure to understand your last sentence here.  Once the
>> archiver is back up, you mean that the WAL segments that were not
>> previously archived still are still not archived?  Or do you mean that
>> because of a succession of checkpoint skipped we are just enable to
>> remove them from pg_wal.
>
> Yes, the latter is correct - we are unable to clean up the already archived WALs
> due to the checkpoint being skipped.

Yes, theoretically you could face this situation if you have an
irregular WAL activity with cycles where nothing happens and an
archive command that keeps failing while there is WAL generated, but
works while WAL is not generated.

>> I am not convinced that this is worth the addition in the skipped
>> path.  If your system is idle and a set of checkpoints is skipped, the
>> data folder is not going to be under extra space pressure because of
>> database activity (okay, unlogged tables even if these would generate
>> some WAL for init pages), because there is nothing happening in it
>> with no "important" WAL generated.  Note that the backend is very
>> unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.
>
>> More to the point: what's the origin of the disk space issues?  System
>> logs, unlogged tables or something else?  It is usually a good
>> practice to move logs to a different partition.  At the end, it sounds
>> to me that removing segments more aggressively is just kicking the can
>> elsewhere, without taking care of the origin of the disk issues.
>
> This problem arises when disk space issues are caused by temporary failed archiving.
> As a result, the pg_wal becomes filled with WALs. This situation
> leads to Postgres being unable to perform any write operations since there is no more
> free disk space left. Usually, cloud providers switch the cluster to a Read-Only mode
> if there is less than 3-4% of the available disk space left, but this also does not resolve
> this problem.

> The actual problem is that after archiving starts working normally again, Postgres is
> unable to free the accumulated WAL and switch to Read-Write mode due to the
> checkpoint being skipped, leading to a vicious circle. However, nothing prevents
> Postgres from exiting such situations on its own. This patch addresses this specific
> behavior, enabling Postgres to resolve such situations autonomously.

Yep, but it does not really solve your disk space issues in a reliable
way.

I am not really convinced that this is worth complicating the skipped
path for this goal.  In my experience, I've seen complaints where WAL
archiving bloat was coming from the archive command not able to keep
up with the amount generated by the backend, particularly because the
command invocation was taking longer than it takes to generate a new
segment.  Even if there is a hole of activity in the server, if too
much WAL has been generated it may not be enough to catch up depending
on the number of segments that need to be processed.  Others are free
to chime in with extra opinions, of course.

While on it, I think that your patch would cause incorrect and early
removal of segments.  It computes the name of the last segment to
remove based on last_important_lsn, ignoring KeepLogSeg(), meaning
that it ignores any WAL retention required by replication slots or
wal_keep_size.  And this causes the calculation of an incorrect segno
horizon.
--
Michael

Attachment

signature.asc

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

"Zakhlystov, Daniil (Nebius)"

Date:

09 November 2023, 11:50:10

Hello,

> On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote:
>
> I am not really convinced that this is worth complicating the skipped
> path for this goal.  In my experience, I've seen complaints where WAL
> archiving bloat was coming from the archive command not able to keep
> up with the amount generated by the backend, particularly because the
> command invocation was taking longer than it takes to generate a new
> segment.  Even if there is a hole of activity in the server, if too
> much WAL has been generated it may not be enough to catch up depending
> on the number of segments that need to be processed.  Others are free
> to chime in with extra opinions, of course.

I agree that there might multiple reasons of pg_wal bloat. Please note that
I am not addressing the WAL archiving issue at all. My proposal is to add a
small improvement to the WAL cleanup routine for WALs that have been already
archived successfully to free the disk space.

Yes, it might be not a common case, but a fairly realistic one. It occurred multiple times
in our production when we had temporary issues with archiving. This small
complication of the skipped path will help Postgres to return to a normal operational
state without any human operator / external control routine intervention.

> On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote:
>
> While on it, I think that your patch would cause incorrect and early
> removal of segments.  It computes the name of the last segment to
> remove based on last_important_lsn, ignoring KeepLogSeg(), meaning
> that it ignores any WAL retention required by replication slots or
> wal_keep_size.  And this causes the calculation of an incorrect segno
> horizon.

Please check the latest patch version, I believe that it has been already fixed there.

Thanks,

Daniil Zakhlystov

Re: Force the old transactions logs cleanup even if checkpoint is skipped

From

Andres Freund

Date:

10 November 2023, 23:58:43

Hi,

On 2023-11-09 11:50:10 +0000, Zakhlystov, Daniil (Nebius) wrote:
> > On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > I am not really convinced that this is worth complicating the skipped
> > path for this goal.  In my experience, I've seen complaints where WAL
> > archiving bloat was coming from the archive command not able to keep
> > up with the amount generated by the backend, particularly because the
> > command invocation was taking longer than it takes to generate a new
> > segment.  Even if there is a hole of activity in the server, if too
> > much WAL has been generated it may not be enough to catch up depending
> > on the number of segments that need to be processed.  Others are free
> > to chime in with extra opinions, of course.
>
> I agree that there might multiple reasons of pg_wal bloat. Please note that
> I am not addressing the WAL archiving issue at all. My proposal is to add a
> small improvement to the WAL cleanup routine for WALs that have been already
> archived successfully to free the disk space.
>
> Yes, it might be not a common case, but a fairly realistic one. It occurred multiple times
> in our production when we had temporary issues with archiving. This small
> complication of the skipped path will help Postgres to return to a normal operational
> state without any human operator / external control routine intervention.

I agree that the scenario is worth addressing - it's quite a nasty situation.

But I'm not sure this is the way to address it. If a checkpoint does have to
happen, we might not get to the point of removing the old segments, because we
might fail to emit the WAL record due to running out of space. And if that
doesn't happen - do we really want to wait till a checkpoint finishes to free
up space?

What if we instead made archiver delete WAL files after archiving, if they're
old enough?  Some care would be needed to avoid checkpointer and archiver
trampling on each other, but that doesn't seem too hard.

Greetings,

Andres Freund