Thread: Force the old transactions logs cleanup even if checkpoint is skipped
Force the old transactions logs cleanup even if checkpoint is skipped
From
"Zakhlystov, Daniil (Nebius)"
Date:
Hi, hackers! I've stumbled into an interesting problem. Currently, if Postgres has nothing to write, it would skip the checkpoint creationdefined by the checkpoint timeout setting. However, we might face a temporary archiving problem (for example, somenetwork issues) that might lead to a pile of wal files stuck in pg_wal. After this temporary issue has gone, we wouldstill be unable to archive them since we effectively skip the checkpoint because we have nothing to write. That might lead to a problem - suppose you've run out of disk space because of the temporary failure of the archiver. Afterthis temporary failure has gone, Postgres would be unable to recover from it automatically and will require human attentionto initiate a CHECKPOINT call. I suggest changing this behavior by trying to clean up the old WAL even if we skip the main checkpoint routine. I've attachedthe patch that does exactly that. What do you think? To reproduce the issue, you might repeat the following steps: 1. Init Postgres: pg_ctl initdb -D /Users/usernamedt/test_archiver 2. Add the archiver script to simulate failure: ➜ ~ cat /Users/usernamedt/command.sh #!/bin/bash false 3. Then alter the PostgreSQL conf: archive_mode = on checkpoint_timeout = 30s archive_command = /Users/usernamedt/command.sh log_min_messages = debug1 4. Then start Postgres: /usr/local/pgsql/bin/pg_ctl -D /Users/usernamedt/test_archiver -l logfile start 5. Insert some data: pgbench -i -s 30 -d postgres 6. Trigger checkpoint to flush all data: psql -c "checkpoint;" 7. Alter the archiver script to simulate the end of archiver issues: ➜ ~ cat /Users/usernamedt/command.sh #!/bin/bash true 8. Check that the WAL files are actually archived but not removed: ➜ ~ ls -lha /Users/usernamedt/test_archiver/pg_wal/archive_status | head total 0 drwx------@ 48 usernamedt LD\Domain Users 1.5K Oct 17 17:44 . drwx------@ 50 usernamedt LD\Domain Users 1.6K Oct 17 17:43 .. -rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:42 000000010000000000000040.done ... -rw-------@ 1 usernamedt LD\Domain Users 0B Oct 17 17:43 00000001000000000000006D.done 2023-10-17 18:03:44.621 +04 [71737] DEBUG: checkpoint skipped because system is idle Thanks, Daniil Zakhlystov
Attachment
Hi, I went through the Cfbot and saw that some test are failing for it (link: https://cirrus-ci.com/task/4631357628874752): test: postgresql:recovery / recovery/019_replslot_limit # test failed ----------------------------------- stderr ----------------------------------- # poll_query_until timed out executing this query: # SELECT '0/15000D8' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby_1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 7. I tried to test it locally and this test is timing out in my local machine as well. Thanks Shlok Kumar Kyal
Re: Force the old transactions logs cleanup even if checkpoint is skipped
From
"Zakhlystov, Daniil (Nebius)"
Date:
Hi! Thanks for your review. I've actually missed the logic to retain some WALs required for replication slots or wal_keep_sizesetting. I've attached the updated version of the patch with an additional call to KeepLogSeg(). Now it passedall the tests, at least in my fork (https://cirrus-ci.com/build/4770244019879936). Daniil Zakhlystov ________________________________________ From: Shlok Kyal <shlok.kyal.oss@gmail.com> Sent: Thursday, November 2, 2023 1:25 PM To: Zakhlystov, Daniil (Nebius) Cc: amborodin@acm.org; pgsql-hackers@lists.postgresql.org; Mokrushin, Mikhail (Nebius) Subject: Re: Force the old transactions logs cleanup even if checkpoint is skipped CAUTION: This email originated from outside mail organization. Do not click links or open attachments unless you recognizethe sender. Hi, I went through the Cfbot and saw that some test are failing for it (link: https://cirrus-ci.com/task/4631357628874752): test: postgresql:recovery / recovery/019_replslot_limit # test failed ----------------------------------- stderr ----------------------------------- # poll_query_until timed out executing this query: # SELECT '0/15000D8' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby_1', 'walreceiver') # expecting this output: # t # last actual query output: # # with stderr: # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 7. I tried to test it locally and this test is timing out in my local machine as well. Thanks Shlok Kumar Kyal
Attachment
Re: Force the old transactions logs cleanup even if checkpoint is skipped
From
Michael Paquier
Date:
On Tue, Oct 17, 2023 at 02:09:21PM +0000, Zakhlystov, Daniil (Nebius) wrote: > I've stumbled into an interesting problem. Currently, if Postgres > has nothing to write, it would skip the checkpoint creation defined > by the checkpoint timeout setting. However, we might face a > temporary archiving problem (for example, some network issues) that > might lead to a pile of wal files stuck in pg_wal. After this > temporary issue has gone, we would still be unable to archive them > since we effectively skip the checkpoint because we have nothing to > write. I am not sure to understand your last sentence here. Once the archiver is back up, you mean that the WAL segments that were not previously archived still are still not archived? Or do you mean that because of a succession of checkpoint skipped we are just enable to remove them from pg_wal. > That might lead to a problem - suppose you've run out of disk space > because of the temporary failure of the archiver. After this > temporary failure has gone, Postgres would be unable to recover from > it automatically and will require human attention to initiate a > CHECKPOINT call. > > I suggest changing this behavior by trying to clean up the old WAL > even if we skip the main checkpoint routine. I've attached the patch > that does exactly that. > > What do you think? I am not convinced that this is worth the addition in the skipped path. If your system is idle and a set of checkpoints is skipped, the data folder is not going to be under extra space pressure because of database activity (okay, unlogged tables even if these would generate some WAL for init pages), because there is nothing happening in it with no "important" WAL generated. Note that the backend is very unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT. More to the point: what's the origin of the disk space issues? System logs, unlogged tables or something else? It is usually a good practice to move logs to a different partition. At the end, it sounds to me that removing segments more aggressively is just kicking the can elsewhere, without taking care of the origin of the disk issues. -- Michael
Attachment
Re: Force the old transactions logs cleanup even if checkpoint is skipped
From
"Zakhlystov, Daniil (Nebius)"
Date:
Hi! Thanks for your review. > I am not sure to understand your last sentence here. Once the > archiver is back up, you mean that the WAL segments that were not > previously archived still are still not archived? Or do you mean that > because of a succession of checkpoint skipped we are just enable to > remove them from pg_wal. Yes, the latter is correct - we are unable to clean up the already archived WALs due to the checkpoint being skipped. > I am not convinced that this is worth the addition in the skipped > path. If your system is idle and a set of checkpoints is skipped, the > data folder is not going to be under extra space pressure because of > database activity (okay, unlogged tables even if these would generate > some WAL for init pages), because there is nothing happening in it > with no "important" WAL generated. Note that the backend is very > unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT. > More to the point: what's the origin of the disk space issues? System > logs, unlogged tables or something else? It is usually a good > practice to move logs to a different partition. At the end, it sounds > to me that removing segments more aggressively is just kicking the can > elsewhere, without taking care of the origin of the disk issues. This problem arises when disk space issues are caused by temporary failed archiving. As a result, the pg_wal becomes filled with WALs. This situation leads to Postgres being unable to perform any write operations since there is no more free disk space left. Usually, cloud providers switch the cluster to a Read-Only mode if there is less than 3-4% of the available disk space left, but this also does not resolve this problem. The actual problem is that after archiving starts working normally again, Postgres is unable to free the accumulated WAL and switch to Read-Write mode due to the checkpoint being skipped, leading to a vicious circle. However, nothing prevents Postgres from exiting such situations on its own. This patch addresses this specific behavior, enabling Postgres to resolve such situations autonomously. Thank you, Daniil Zakhlystov
Re: Force the old transactions logs cleanup even if checkpoint is skipped
From
Michael Paquier
Date:
On Wed, Nov 08, 2023 at 12:44:09PM +0000, Zakhlystov, Daniil (Nebius) wrote: >> I am not sure to understand your last sentence here. Once the >> archiver is back up, you mean that the WAL segments that were not >> previously archived still are still not archived? Or do you mean that >> because of a succession of checkpoint skipped we are just enable to >> remove them from pg_wal. > > Yes, the latter is correct - we are unable to clean up the already archived WALs > due to the checkpoint being skipped. Yes, theoretically you could face this situation if you have an irregular WAL activity with cycles where nothing happens and an archive command that keeps failing while there is WAL generated, but works while WAL is not generated. >> I am not convinced that this is worth the addition in the skipped >> path. If your system is idle and a set of checkpoints is skipped, the >> data folder is not going to be under extra space pressure because of >> database activity (okay, unlogged tables even if these would generate >> some WAL for init pages), because there is nothing happening in it >> with no "important" WAL generated. Note that the backend is very >> unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT. > >> More to the point: what's the origin of the disk space issues? System >> logs, unlogged tables or something else? It is usually a good >> practice to move logs to a different partition. At the end, it sounds >> to me that removing segments more aggressively is just kicking the can >> elsewhere, without taking care of the origin of the disk issues. > > This problem arises when disk space issues are caused by temporary failed archiving. > As a result, the pg_wal becomes filled with WALs. This situation > leads to Postgres being unable to perform any write operations since there is no more > free disk space left. Usually, cloud providers switch the cluster to a Read-Only mode > if there is less than 3-4% of the available disk space left, but this also does not resolve > this problem. > The actual problem is that after archiving starts working normally again, Postgres is > unable to free the accumulated WAL and switch to Read-Write mode due to the > checkpoint being skipped, leading to a vicious circle. However, nothing prevents > Postgres from exiting such situations on its own. This patch addresses this specific > behavior, enabling Postgres to resolve such situations autonomously. Yep, but it does not really solve your disk space issues in a reliable way. I am not really convinced that this is worth complicating the skipped path for this goal. In my experience, I've seen complaints where WAL archiving bloat was coming from the archive command not able to keep up with the amount generated by the backend, particularly because the command invocation was taking longer than it takes to generate a new segment. Even if there is a hole of activity in the server, if too much WAL has been generated it may not be enough to catch up depending on the number of segments that need to be processed. Others are free to chime in with extra opinions, of course. While on it, I think that your patch would cause incorrect and early removal of segments. It computes the name of the last segment to remove based on last_important_lsn, ignoring KeepLogSeg(), meaning that it ignores any WAL retention required by replication slots or wal_keep_size. And this causes the calculation of an incorrect segno horizon. -- Michael
Attachment
Re: Force the old transactions logs cleanup even if checkpoint is skipped
From
"Zakhlystov, Daniil (Nebius)"
Date:
Hello, > On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote: > > I am not really convinced that this is worth complicating the skipped > path for this goal. In my experience, I've seen complaints where WAL > archiving bloat was coming from the archive command not able to keep > up with the amount generated by the backend, particularly because the > command invocation was taking longer than it takes to generate a new > segment. Even if there is a hole of activity in the server, if too > much WAL has been generated it may not be enough to catch up depending > on the number of segments that need to be processed. Others are free > to chime in with extra opinions, of course. I agree that there might multiple reasons of pg_wal bloat. Please note that I am not addressing the WAL archiving issue at all. My proposal is to add a small improvement to the WAL cleanup routine for WALs that have been already archived successfully to free the disk space. Yes, it might be not a common case, but a fairly realistic one. It occurred multiple times in our production when we had temporary issues with archiving. This small complication of the skipped path will help Postgres to return to a normal operational state without any human operator / external control routine intervention. > On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote: > > While on it, I think that your patch would cause incorrect and early > removal of segments. It computes the name of the last segment to > remove based on last_important_lsn, ignoring KeepLogSeg(), meaning > that it ignores any WAL retention required by replication slots or > wal_keep_size. And this causes the calculation of an incorrect segno > horizon. Please check the latest patch version, I believe that it has been already fixed there. Thanks, Daniil Zakhlystov
Hi, On 2023-11-09 11:50:10 +0000, Zakhlystov, Daniil (Nebius) wrote: > > On 9 Nov 2023, at 01:30, Michael Paquier <michael@paquier.xyz> wrote: > > > > I am not really convinced that this is worth complicating the skipped > > path for this goal. In my experience, I've seen complaints where WAL > > archiving bloat was coming from the archive command not able to keep > > up with the amount generated by the backend, particularly because the > > command invocation was taking longer than it takes to generate a new > > segment. Even if there is a hole of activity in the server, if too > > much WAL has been generated it may not be enough to catch up depending > > on the number of segments that need to be processed. Others are free > > to chime in with extra opinions, of course. > > I agree that there might multiple reasons of pg_wal bloat. Please note that > I am not addressing the WAL archiving issue at all. My proposal is to add a > small improvement to the WAL cleanup routine for WALs that have been already > archived successfully to free the disk space. > > Yes, it might be not a common case, but a fairly realistic one. It occurred multiple times > in our production when we had temporary issues with archiving. This small > complication of the skipped path will help Postgres to return to a normal operational > state without any human operator / external control routine intervention. I agree that the scenario is worth addressing - it's quite a nasty situation. But I'm not sure this is the way to address it. If a checkpoint does have to happen, we might not get to the point of removing the old segments, because we might fail to emit the WAL record due to running out of space. And if that doesn't happen - do we really want to wait till a checkpoint finishes to free up space? What if we instead made archiver delete WAL files after archiving, if they're old enough? Some care would be needed to avoid checkpointer and archiver trampling on each other, but that doesn't seem too hard. Greetings, Andres Freund