Thread: A few new options for CHECKPOINT
Hi hackers, I've attached a patch to add a few new options to CHECKPOINT that might be useful. Specifically, it adds the FORCE, IMMEDIATE, and WAIT options. All of these options are already implemented internally, but it is not possible to adjust them for manually triggered checkpoints. The bulk of this change is dedicated to expanding the syntax of CHECKPOINT and adding detailed documentation. I've mostly followed the pattern set forth by the options recently added to VACUUM. With this patch, CHECKPOINT takes an optional set of parameters surround by parentheses. The new parameters are enabled by default but can be disabled by specifying FALSE, OFF, or 0. The main purpose of this patch is to give users more control over their manually requested checkpoints or restartpoints. I suspect the most useful option is IMMEDIATE, which can help avoid checkpoint- related IO spikes. However, I didn't see any strong reason to prevent users from also adjusting FORCE and WAIT. Nathan
Attachment
From: Bossart, Nathan <bossartn@amazon.com> > The main purpose of this patch is to give users more control over their manually > requested checkpoints or restartpoints. I suspect the most useful option is > IMMEDIATE, which can help avoid checkpoint- related IO spikes. However, I > didn't see any strong reason to prevent users from also adjusting FORCE and > WAIT. I think just IMMEDIATE would suffice, too. But could you tell us why you got to want to give users more control? Couldwe know concrete example situations where users want to perform CHECKPOINT with options? Regards Takayuki Tsunakawa
On 11/24/20, 4:03 PM, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote: > From: Bossart, Nathan <bossartn@amazon.com> >> The main purpose of this patch is to give users more control over their manually >> requested checkpoints or restartpoints. I suspect the most useful option is >> IMMEDIATE, which can help avoid checkpoint- related IO spikes. However, I >> didn't see any strong reason to prevent users from also adjusting FORCE and >> WAIT. > > I think just IMMEDIATE would suffice, too. But could you tell us why you got to want to give users more control? Couldwe know concrete example situations where users want to perform CHECKPOINT with options? It may be useful for backups taken with the "consistent snapshot" approach. As noted in the documentation [0], running CHECKPOINT before taking the snapshot can reduce recovery time. However, users might wish to avoid the IO spike caused by an immediate checkpoint. Nathan [0] https://www.postgresql.org/docs/devel/backup-file.html
From: Bossart, Nathan <bossartn@amazon.com> > It may be useful for backups taken with the "consistent snapshot" > approach. As noted in the documentation [0], running CHECKPOINT > before taking the snapshot can reduce recovery time. However, users > might wish to avoid the IO spike caused by an immediate checkpoint. > > [0] https://www.postgresql.org/docs/devel/backup-file.html Ah, understood. I agree that the slow or spread manual checkpoint is good to have. Regards Takayuki Tsunakawa
On Wed, Nov 25, 2020 at 01:07:47AM +0000, tsunakawa.takay@fujitsu.com wrote: > From: Bossart, Nathan <bossartn@amazon.com> >> It may be useful for backups taken with the "consistent snapshot" >> approach. As noted in the documentation [0], running CHECKPOINT >> before taking the snapshot can reduce recovery time. However, users >> might wish to avoid the IO spike caused by an immediate checkpoint. >> >> [0] https://www.postgresql.org/docs/devel/backup-file.html > > Ah, understood. I agree that the slow or spread manual checkpoint is good to have. I can see the use case for IMMEDIATE, but I fail to see the use cases for WAIT and FORCE. CHECKPOINT_FORCE is internally implied for the end-of-recovery and shutdown checkpoints. WAIT could be a dangerous thing if disabled, as clients could pile up requests to the checkpointer for no real purpose. -- Michael
Attachment
On 2020/11/25 13:47, Michael Paquier wrote: > On Wed, Nov 25, 2020 at 01:07:47AM +0000, tsunakawa.takay@fujitsu.com wrote: >> From: Bossart, Nathan <bossartn@amazon.com> >>> It may be useful for backups taken with the "consistent snapshot" >>> approach. As noted in the documentation [0], running CHECKPOINT >>> before taking the snapshot can reduce recovery time. However, users >>> might wish to avoid the IO spike caused by an immediate checkpoint. >>> >>> [0] https://www.postgresql.org/docs/devel/backup-file.html >> >> Ah, understood. I agree that the slow or spread manual checkpoint is good to have. > > I can see the use case for IMMEDIATE, but I fail to see the use cases > for WAIT and FORCE. CHECKPOINT_FORCE is internally implied for the > end-of-recovery and shutdown checkpoints. WAIT could be a dangerous > thing if disabled, as clients could pile up requests to the > checkpointer for no real purpose. We may want to disable WAIT (or specify wait timeout?) when doing checkpoint with IMMEDIATE disabled, to avoid long-running command. OTOH, if we support WAIT disabled, I'd like to have the feature to see whether the checkpoint has been completed or not. We can do that by using log_checkpoints, but that's not convenient. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Am Mittwoch, den 25.11.2020, 13:47 +0900 schrieb Michael Paquier: > I can see the use case for IMMEDIATE, but I fail to see the use cases > for WAIT and FORCE. CHECKPOINT_FORCE is internally implied for the > end-of-recovery and shutdown checkpoints. WAIT could be a dangerous > thing if disabled, as clients could pile up requests to the > checkpointer for no real purpose. Wouldn't it be more convenient to use "FAST" for immediate checkpoint, defaulting to "FAST ON"? That would be along the parameter used in the streaming protocol command BASE_BACKUP, where "FAST" disables lazy checkpointing. I agree that the other options don't seem reasonable for exposing to SQL. -- Thanks, Bernd
On Wed, 2020-11-25 at 11:41 +0100, Bernd Helmle wrote: > Am Mittwoch, den 25.11.2020, 13:47 +0900 schrieb Michael Paquier: > > > I can see the use case for IMMEDIATE, but I fail to see the use cases > > for WAIT and FORCE. CHECKPOINT_FORCE is internally implied for the > > end-of-recovery and shutdown checkpoints. WAIT could be a dangerous > > thing if disabled, as clients could pile up requests to the > > checkpointer for no real purpose. > > Wouldn't it be more convenient to use "FAST" for immediate checkpoint, > defaulting to "FAST ON"? That would be along the parameter used in the > streaming protocol command BASE_BACKUP, where "FAST" disables lazy > checkpointing. +1 That would also match pg_basebackup's "-c fast|spread". Yours, Laurenz Albe
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > On 11/24/20, 4:03 PM, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote: > > From: Bossart, Nathan <bossartn@amazon.com> > >> The main purpose of this patch is to give users more control over their manually > >> requested checkpoints or restartpoints. I suspect the most useful option is > >> IMMEDIATE, which can help avoid checkpoint- related IO spikes. However, I > >> didn't see any strong reason to prevent users from also adjusting FORCE and > >> WAIT. > > > > I think just IMMEDIATE would suffice, too. But could you tell us why you got to want to give users more control? Couldwe know concrete example situations where users want to perform CHECKPOINT with options? > > It may be useful for backups taken with the "consistent snapshot" > approach. As noted in the documentation [0], running CHECKPOINT > before taking the snapshot can reduce recovery time. However, users > might wish to avoid the IO spike caused by an immediate checkpoint. I'm a bit confused by the idea here.. The whole point of running a CHECKPOINT is to get the immediate behavior with the IO spike to get things flushed out to disk so that, on crash recovery, there's less outstanding WAL to replay. Avoiding the IO spike implies that you're waiting for a regular checkpoint and that additional WAL is building up since that started and therefore you're going to have to replay that WAL during crash recovery and so you won't end up reducing your recovery time, so I'm failing to see the point..? I don't think you really get to have both.. pay the price at backup time, or pay it during crash recovery. Thanks, Stephen
Attachment
Thanks to all for the feedback. I've attached v2 of the patch. I've removed the WAIT and FORCE options and renamed IMMEDIATE to FAST. On 11/25/20, 7:52 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > I'm a bit confused by the idea here.. The whole point of running a > CHECKPOINT is to get the immediate behavior with the IO spike to get > things flushed out to disk so that, on crash recovery, there's less > outstanding WAL to replay. > > Avoiding the IO spike implies that you're waiting for a regular > checkpoint and that additional WAL is building up since that started and > therefore you're going to have to replay that WAL during crash recovery > and so you won't end up reducing your recovery time, so I'm failing to > see the point..? I don't think you really get to have both.. pay the > price at backup time, or pay it during crash recovery. It's true that you might not lower recovery time as much as if you did an immediate checkpoint, but presumably there can still be some benefit from doing a non-immediate checkpoint. I think a similar argument can be made for pg_start_backup(), which AFAICT is presently the only way to manually request a non-immediate checkpoint. Nathan
Attachment
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > Thanks to all for the feedback. I've attached v2 of the patch. I've > removed the WAIT and FORCE options and renamed IMMEDIATE to FAST. > > On 11/25/20, 7:52 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > > I'm a bit confused by the idea here.. The whole point of running a > > CHECKPOINT is to get the immediate behavior with the IO spike to get > > things flushed out to disk so that, on crash recovery, there's less > > outstanding WAL to replay. > > > > Avoiding the IO spike implies that you're waiting for a regular > > checkpoint and that additional WAL is building up since that started and > > therefore you're going to have to replay that WAL during crash recovery > > and so you won't end up reducing your recovery time, so I'm failing to > > see the point..? I don't think you really get to have both.. pay the > > price at backup time, or pay it during crash recovery. > > It's true that you might not lower recovery time as much as if you did > an immediate checkpoint, but presumably there can still be some > benefit from doing a non-immediate checkpoint. I think a similar > argument can be made for pg_start_backup(), which AFAICT is presently > the only way to manually request a non-immediate checkpoint. If there isn't any actual outstanding WAL since the last checkpoint, the only time that the fact that pg_start_backup includes CHECKPOINT_FORCE is relevant, then performing a checkpoint isn't going to actually reduce your crash recovery. Also note that, in all other cases (that is, when there *is* outstanding WAL since the last checkpoint), pg_start_backup actually just waits for the existing checkpoint to complete- and while it's waiting for that to happen, there'll be additional WAL building up since that checkpoint started that will have to be replayed as part of crash recovery, just as if you took a snapshot of the system at any other time. So, either there won't be any WAL outstanding, in which case running a CHECKPOINT FORCE just ends up creating more WAL without actually being useful, or there's WAL outstanding and the only thing this does is delay the snapshot being taken but doesn't actually reduce the amount of WAL that's going to end up being outstanding and which will have to be replayed during crash recovery. Maybe there's a useful reason to have these options, but at least the stated one isn't it and I wouldn't want to encourage people who are using snapshot-based backups to use these options since they aren't going to work the way that this thread is implying they would. Thanks, Stephen
Attachment
On 11/27/20, 8:29 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > Also note that, in all other cases (that is, when there *is* outstanding > WAL since the last checkpoint), pg_start_backup actually just waits for > the existing checkpoint to complete- and while it's waiting for that to > happen, there'll be additional WAL building up since that checkpoint > started that will have to be replayed as part of crash recovery, just as > if you took a snapshot of the system at any other time. > > So, either there won't be any WAL outstanding, in which case running a > CHECKPOINT FORCE just ends up creating more WAL without actually being > useful, or there's WAL outstanding and the only thing this does is delay > the snapshot being taken but doesn't actually reduce the amount of WAL > that's going to end up being outstanding and which will have to be > replayed during crash recovery. > > Maybe there's a useful reason to have these options, but at least the > stated one isn't it and I wouldn't want to encourage people who are > using snapshot-based backups to use these options since they aren't > going to work the way that this thread is implying they would. I don't think it's true that pg_start_backup() just waits for the existing checkpoint to complete. It calls RequestCheckpoint() with CHECKPOINT_WAIT, which should wait for a new checkpoint to start. /* Wait for a new checkpoint to start. */ ConditionVariablePrepareToSleep(&CheckpointerShmem->start_cv); for (;;) { I also tried running pg_start_backup() while an automatic checkpoint was ongoing, and it seemed to create a new one. psql session: postgres=# SELECT now(); SELECT pg_start_backup('test'); SELECT now(); now ------------------------------- 2020-11-27 16:52:31.958124+00 (1 row) pg_start_backup ----------------- 0/D3D24F0 (1 row) now ------------------------------- 2020-11-27 16:52:50.113372+00 (1 row) logs: 2020-11-27 16:52:20.129 UTC [16029] LOG: checkpoint starting: time 2020-11-27 16:52:35.121 UTC [16029] LOG: checkpoint complete... 2020-11-27 16:52:35.122 UTC [16029] LOG: checkpoint starting: force wait 2020-11-27 16:52:50.110 UTC [16029] LOG: checkpoint complete... The patch I've submitted does the same thing. psql session: postgres=# SELECT now(); CHECKPOINT (FAST FALSE); SELECT now(); now ------------------------------- 2020-11-27 16:46:39.346131+00 (1 row) CHECKPOINT now ------------------------------- 2020-11-27 16:47:05.083944+00 (1 row) logs: 2020-11-27 16:46:35.056 UTC [16029] LOG: checkpoint starting: time 2020-11-27 16:46:50.099 UTC [16029] LOG: checkpoint complete... 2020-11-27 16:46:50.099 UTC [16029] LOG: checkpoint starting: force wait 2020-11-27 16:47:05.083 UTC [16029] LOG: checkpoint complete... Even if it did simply wait for the existing checkpoint to complete, isn't it still preferable to take a snapshot right after a checkpoint completes, even if it is non-immediate? You'll need to replay WAL in either case, and it's true that you could need to replay less WAL if you take an immediate checkpoint versus a non-immediate checkpoint. However, if you take a snapshot without a checkpoint, you might need to replay up to checkpoint_timeout + (time it takes for a non- immediate checkpoint to complete) worth of WAL. For the logs just above this paragraph, if I take a snapshot at 16:47:04, I'd need to replay 29 seconds of WAL. However, if I take the snapshot at 16:47:06, I only need to replay 16 seconds of WAL. I apologize if I'm missing something obvious here. Nathan
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > On 11/27/20, 8:29 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > > Also note that, in all other cases (that is, when there *is* outstanding > > WAL since the last checkpoint), pg_start_backup actually just waits for > > the existing checkpoint to complete- and while it's waiting for that to > > happen, there'll be additional WAL building up since that checkpoint > > started that will have to be replayed as part of crash recovery, just as > > if you took a snapshot of the system at any other time. > > > > So, either there won't be any WAL outstanding, in which case running a > > CHECKPOINT FORCE just ends up creating more WAL without actually being > > useful, or there's WAL outstanding and the only thing this does is delay > > the snapshot being taken but doesn't actually reduce the amount of WAL > > that's going to end up being outstanding and which will have to be > > replayed during crash recovery. > > > > Maybe there's a useful reason to have these options, but at least the > > stated one isn't it and I wouldn't want to encourage people who are > > using snapshot-based backups to use these options since they aren't > > going to work the way that this thread is implying they would. > > I don't think it's true that pg_start_backup() just waits for the > existing checkpoint to complete. It calls RequestCheckpoint() with > CHECKPOINT_WAIT, which should wait for a new checkpoint to start. erm... right? pg_start_backup waits for a new checkpoint to start. I'm confused how that's different from "waits for the existing checkpoint to complete". The entire point is that it *doesn't* force a checkpoint to happen immediately. > /* Wait for a new checkpoint to start. */ > ConditionVariablePrepareToSleep(&CheckpointerShmem->start_cv); > for (;;) > { > > I also tried running pg_start_backup() while an automatic checkpoint > was ongoing, and it seemed to create a new one. > > psql session: > postgres=# SELECT now(); SELECT pg_start_backup('test'); SELECT now(); > now > ------------------------------- > 2020-11-27 16:52:31.958124+00 > (1 row) > > pg_start_backup > ----------------- > 0/D3D24F0 > (1 row) > > now > ------------------------------- > 2020-11-27 16:52:50.113372+00 > (1 row) > > logs: > 2020-11-27 16:52:20.129 UTC [16029] LOG: checkpoint starting: time > 2020-11-27 16:52:35.121 UTC [16029] LOG: checkpoint complete... > 2020-11-27 16:52:35.122 UTC [16029] LOG: checkpoint starting: force wait > 2020-11-27 16:52:50.110 UTC [16029] LOG: checkpoint complete... Yes- pg_start_backup specifies 'force' because it wants a checkpoint to happen even if there isn't any outstanding WAL, but it doesn't make the existing checkpoint go faster, which is the point that I'm trying to make here. If you'd really like to test and see what happens, run a pgbench that loads the system up while doing pg_start_backup and see how long it takes before pg_start_backup returns, and see how much outstanding WAL there is from the starting checkpoint from pg_start_backup and the time it returns. To make it really clear, you should really also set checkpoint completion timeout to 0.9 and make sure you max_wal_size is set to a pretty large value, and make sure that the pgbench is generating enough to have pretty large checkpoints while still allowing them to happen due to 'time'. > The patch I've submitted does the same thing. Yes, I'm not surprised by that. That doesn't change anything about the point I'm trying to make.. > Even if it did simply wait for the existing checkpoint to complete, > isn't it still preferable to take a snapshot right after a checkpoint > completes, even if it is non-immediate? You'll need to replay WAL in > either case, and it's true that you could need to replay less WAL if > you take an immediate checkpoint versus a non-immediate checkpoint. Why is it preferable? Your argument here is that it's preferable because there'll be less outstanding WAL, but what I'm pointing out is that that's not the case, so that isn't a reason for it to be preferable and so I'm asking: what other reason is it preferable..? I'm not aware of one.. > However, if you take a snapshot without a checkpoint, you might need > to replay up to checkpoint_timeout + (time it takes for a non- > immediate checkpoint to complete) worth of WAL. For the logs just > above this paragraph, if I take a snapshot at 16:47:04, I'd need to > replay 29 seconds of WAL. However, if I take the snapshot at > 16:47:06, I only need to replay 16 seconds of WAL. This is exactly the point I'm making- if you're using a deferred checkpoint then you're still going to have up to checkpoint_timeout worth of WAL to replay. Again, these tests aren't really worth much because the system clearly isn't under any load.. > I apologize if I'm missing something obvious here. If you'd like to show that I'm wrong, and it's entirely possible that I am, then retry the above with actual load on the system, and also actually look at how much outstanding WAL you end up with given the different scenarios which has to be replayed during crash recovery. Thanks, Stephen
Attachment
On 11/27/20, 10:58 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > If you'd like to show that I'm wrong, and it's entirely possible that I > am, then retry the above with actual load on the system, and also > actually look at how much outstanding WAL you end up with given the > different scenarios which has to be replayed during crash recovery. I did a little experiment to show the behavior I'm referring to. I used these settings: checkpoint_completion_target = 0.9 checkpoint_timeout = 30s max_wal_size = 20GB WAL segment size is 64MB I ran the following pgbench command for a few minutes before each test: pgbench postgres -T 3600 -c 64 -j 64 -N For the first test, I killed Postgres just before an automatic, non- immediate checkpoint completed. 2020-11-28 00:31:57 UTC::@:[51770]:LOG: checkpoint complete... 2020-11-28 00:32:00 UTC::@:[51770]:LOG: checkpoint starting: time Killed Postgres at 00:32:26 UTC, 29 seconds after latest checkpoint completed. 2020-11-28 00:32:42 UTC::@:[77256]:LOG: redo starts at 3CF/FD6B8BD0 2020-11-28 00:32:56 UTC::@:[77256]:LOG: redo done at 3D0/C94D1D00 Recovery took 14 seconds and replayed ~3.2 GB of WAL. postgres=> SELECT pg_wal_lsn_diff('3D0/C94D1D00', '3CF/FD6B8BD0'); pg_wal_lsn_diff ----------------- 3420557616 (1 row) For the second test, I killed Postgres just after an automatic, non- immediate checkpoint completed. 2020-11-28 00:41:26 UTC::@:[77475]:LOG: checkpoint complete... Killed Postgres at 00:41:26 UTC, just after latest checkpoint completed. 2020-11-28 00:41:42 UTC::@:[8599]:LOG: redo starts at 3D3/152EDD78 2020-11-28 00:41:49 UTC::@:[8599]:LOG: redo done at 3D3/78358A40 Recovery took 7 seconds and replayed ~1.5 GB of WAL. postgres=> SELECT pg_wal_lsn_diff('3D3/78358A40', '3D3/152EDD78'); pg_wal_lsn_diff ----------------- 1661381832 (1 row) Granted, I used a rather aggressive checkpoint_timeout, but I think this demonstrates that waiting for a non-immediate checkpoint to complete can lower the amount of WAL needed for recovery, even though it might not lower it as much as waiting for an immediate checkpoint would. Nathan
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > On 11/27/20, 10:58 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > > If you'd like to show that I'm wrong, and it's entirely possible that I > > am, then retry the above with actual load on the system, and also > > actually look at how much outstanding WAL you end up with given the > > different scenarios which has to be replayed during crash recovery. > > I did a little experiment to show the behavior I'm referring to. I I'm rather confused why you didn't use your patch to show the actual behavior that you'll get as a result of this change..? That seems like what would be meaningful here. I appreciate that the analysis you did might correlate but I don't really get why we'd try to use a proxy for this. > used these settings: > > checkpoint_completion_target = 0.9 > checkpoint_timeout = 30s > max_wal_size = 20GB > WAL segment size is 64MB That's an exceedingly short and very uncommon checkpoint timeout in my experience.. If anything, folks increase checkpoint timeout from the default (in order to reduce WAL traffic and because they have a replica they can flip to in the event of a crash, avoiding having to go through WAL replay on recovery) and so I'm not really sure that it's a sensible thing to look at? Even so though... > Recovery took 14 seconds and replayed ~3.2 GB of WAL. [ ... ] > Recovery took 7 seconds and replayed ~1.5 GB of WAL. This is showing more-or-less what I expected: there's still a large amount of outstanding WAL, even if you use a very low and unusual timeout and attempt to time it perfectly. A question that is still not clear is what happens when you actually do an immediate checkpoint- there would likely still be some outstanding WAL even in that case but I'd expect it to be a whole lot less, which is why that comment exists in the documentation in the first place. > Granted, I used a rather aggressive checkpoint_timeout, but I think > this demonstrates that waiting for a non-immediate checkpoint to > complete can lower the amount of WAL needed for recovery, even though > it might not lower it as much as waiting for an immediate checkpoint > would. The difference here feels like order of magnitudes to me, between an immediate checkpoint and a non-immediate one, vs. a much smaller difference as you've shown here (though, still, kill'ing the postmaster isn't exactly the same as what your patch would be doing, so I don't really like using this particular analysis to answer this question...). If the use-case here was just that you wanted to add more options to the CHECKPOINT command because we have them internally and maybe they'd be useful to expose then these things probably wouldn't matter so much, but to imply that this is really going to cut down on the amount of WAL replay required a lot isn't really coming through even with these results. Thanks, Stephen
Attachment
On 11/28/20, 9:50 AM, "Stephen Frost" <sfrost@snowman.net> wrote: >> Granted, I used a rather aggressive checkpoint_timeout, but I think >> this demonstrates that waiting for a non-immediate checkpoint to >> complete can lower the amount of WAL needed for recovery, even though >> it might not lower it as much as waiting for an immediate checkpoint >> would. > > The difference here feels like order of magnitudes to me, between an > immediate checkpoint and a non-immediate one, vs. a much smaller > difference as you've shown here (though, still, kill'ing the postmaster > isn't exactly the same as what your patch would be doing, so I don't > really like using this particular analysis to answer this question...). I agree that using an immediate checkpoint is the best for reducing recovery time. IMO reducing the amount of WAL to recover by up to 50% from doing no checkpoint at all is also a reasonable use case, especially if avoiding an IO spike is important. > If the use-case here was just that you wanted to add more options to the > CHECKPOINT command because we have them internally and maybe they'd be > useful to expose then these things probably wouldn't matter so much, but > to imply that this is really going to cut down on the amount of WAL > replay required a lot isn't really coming through even with these > results. This is how I initially presented this patch. I only included this use case because I was asked for concrete examples of how it might be useful. Nathan
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > On 11/28/20, 9:50 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > >> Granted, I used a rather aggressive checkpoint_timeout, but I think > >> this demonstrates that waiting for a non-immediate checkpoint to > >> complete can lower the amount of WAL needed for recovery, even though > >> it might not lower it as much as waiting for an immediate checkpoint > >> would. > > > > The difference here feels like order of magnitudes to me, between an > > immediate checkpoint and a non-immediate one, vs. a much smaller > > difference as you've shown here (though, still, kill'ing the postmaster > > isn't exactly the same as what your patch would be doing, so I don't > > really like using this particular analysis to answer this question...). > > I agree that using an immediate checkpoint is the best for reducing > recovery time. IMO reducing the amount of WAL to recover by up to 50% > from doing no checkpoint at all is also a reasonable use case, > especially if avoiding an IO spike is important. Checkpoints are always happening though, that's kind of my point..? Sure, you get lucky sometimes that the time you snapshot might have less outstanding WAL than at some other time, but I'm not convinced that this patch is really going to give a given user that much reduced amount of WAL that has to be replayed more than just randomly timing the snapshot, and if it's not clearly better, always, then I don't think we can reasonably document it as such or imply that this is how folks should implement snapshot-based backups. > > If the use-case here was just that you wanted to add more options to the > > CHECKPOINT command because we have them internally and maybe they'd be > > useful to expose then these things probably wouldn't matter so much, but > > to imply that this is really going to cut down on the amount of WAL > > replay required a lot isn't really coming through even with these > > results. > > This is how I initially presented this patch. I only included this > use case because I was asked for concrete examples of how it might be > useful. Did you have other concrete examples that we could reference as to when it would be useful to use these options? Thanks, Stephen
Attachment
On 11/29/20, 7:21 PM, "Stephen Frost" <sfrost@snowman.net> wrote: > Checkpoints are always happening though, that's kind of my point..? > Sure, you get lucky sometimes that the time you snapshot might have less > outstanding WAL than at some other time, but I'm not convinced that this > patch is really going to give a given user that much reduced amount of > WAL that has to be replayed more than just randomly timing the > snapshot, and if it's not clearly better, always, then I don't think we > can reasonably document it as such or imply that this is how folks > should implement snapshot-based backups. I see your point. Using a non-immediate checkpoint might help you keep your recovery time slightly more consistent, but you're right that it's likely not going to be a dramatic improvement. Your recovery time will be 1 minute versus 1-2 minutes or 2 minutes versus 2-4 minutes, not 3 seconds versus 5 minutes. > Did you have other concrete examples that we could reference as to when > it would be useful to use these options? I don't have any at the moment. I figured that if we're going to allow users to manually trigger checkpoints, we might as well allow them to configure it to avoid things like IO spikes. Nathan
On the UI of this patch, you're proposing to add the option FAST. I'm not a fan of this option name and propose that (if we have it) we use the name SPREAD instead (defaults to false). Now we don't actually explain the term "spread" much in the documentation; we just say "the writes are spread". But it seems more natural to build on that adjective rather than "fast/slow". I think starting a spread checkpoint has some usefulness, if your checkpoint interval is very large but your completion target is not very close to 1. In that case, you're expressing that you want a checkpoint to start now and not impact production unduly, so that you know when it finishes and therefore when is it a good time to start a backup. (You will still have some WAL to replay, but it won't be as much as if you just ignored checkpoint considerations completely.) On the subject of measuring replay times for backups taking while pgbench is pounding the database, I think a realistic test does *not* have pgbench running at top speed; rather you have some non-maximal "-R xyz" option. You would probably determine a value to use by running without -R, observing what's a typical transaction rate, and using some fraction (say, half) of that in the real run.
Greetings, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > I think starting a spread checkpoint has some usefulness, if your > checkpoint interval is very large but your completion target is not very > close to 1. In that case, you're expressing that you want a checkpoint > to start now and not impact production unduly, so that you know when it > finishes and therefore when is it a good time to start a backup. (You > will still have some WAL to replay, but it won't be as much as if you > just ignored checkpoint considerations completely.) You could view an immediate checkpoint as more-or-less being a 'spread' checkpoint with a checkpoint completion target approaching 0. In the end, it's all about how much time you're going to spend trying to get the data written out, because the WAL that's generated during that time is what's going to have to get replayed. If the goal is to not end up with an increase in IO from this, then you want to spread things out as much as you can over as much time as you're able to- but that then means that you're going to have that much WAL to replay. If you're alright with performing IO to get the amount of WAL to replay to be minimal, then you just run 'CHECKPOINT;' before your backup and you're good to go (and is why that's in the documentation as a way to reduce your WAL replay time- because it reduces it as much as possible given your IO capabilities). If you don't mind the increased amount of IO and WAL, you could just reduce checkpoint_timeout and then crash recovery and snapshot-based backup recovery will also be reduced, no matter when you actually take the snapshot. > On the subject of measuring replay times for backups taking while > pgbench is pounding the database, I think a realistic test does *not* > have pgbench running at top speed; rather you have some non-maximal > "-R xyz" option. You would probably determine a value to use by running > without -R, observing what's a typical transaction rate, and using some > fraction (say, half) of that in the real run. That'd halve the amount of WAL being generated per unit time, but I don't think it really changes much when it comes to this particular analysis..? If you generate 16MB of WAL per minute, and the checkpoint timeout is 5 minutes, with a checkpoint target of 0.9, then at more-or-less any point in time you've got ~5 minutes worth of WAL outstanding, or around 80MB. If your completion target is 0.5 then, really, you might as well make it 0.9 and have your timeout be 2.5m, so that you've got a steady-state of around 40MB of WAL outstanding. What I'm getting around to is that the only place this kind of thing makes sense is where you're front-loading all your IO during the checkpoint because your checkpoint completion target is less than 0.9 and then, sure, there's a difference between snapshotting right when the checkpoint completes vs. later- because if you wait around to snapshot, we aren't actually doing IO during that time and just letting the WAL build up, but that's an argument to remove checkpoint completion target as an option that doesn't really make much sense in the first place, imv, and recommend folks tune checkpoint timeout for the amount of outstanding WAL they want to have when they are doing recovery (either from a crash or from a snapshot). Thanks, Stephen
Attachment
On 12/4/20, 1:47 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > On the UI of this patch, you're proposing to add the option FAST. I'm > not a fan of this option name and propose that (if we have it) we use > the name SPREAD instead (defaults to false). > > Now we don't actually explain the term "spread" much in the documentation; > we just say "the writes are spread". But it seems more natural to build > on that adjective rather than "fast/slow". Here is a version of the patch that uses SPREAD instead of FAST. Nathan
Attachment
On 2020-Dec-04, Bossart, Nathan wrote: > On 12/4/20, 1:47 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > On the UI of this patch, you're proposing to add the option FAST. I'm > > not a fan of this option name and propose that (if we have it) we use > > the name SPREAD instead (defaults to false). > > > > Now we don't actually explain the term "spread" much in the documentation; > > we just say "the writes are spread". But it seems more natural to build > > on that adjective rather than "fast/slow". > > Here is a version of the patch that uses SPREAD instead of FAST. WFM. Instead of adding checkpt_option_list, how about utility_option_list? It seems intended for reuse.
On 12/4/20, 3:33 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > Instead of adding checkpt_option_list, how about utility_option_list? > It seems intended for reuse. Ah, good call. That simplifies the grammar changes quite a bit. Nathan
Attachment
On Sat, Dec 05, 2020 at 12:11:13AM +0000, Bossart, Nathan wrote: > On 12/4/20, 3:33 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: >> Instead of adding checkpt_option_list, how about utility_option_list? >> It seems intended for reuse. +1. It is intended for reuse. > Ah, good call. That simplifies the grammar changes quite a bit. +CHECKPOINT; +CHECKPOINT (SPREAD); +CHECKPOINT (SPREAD FALSE); +CHECKPOINT (SPREAD ON); +CHECKPOINT (SPREAD 0); +CHECKPOINT (SPREAD 2); +ERROR: spread requires a Boolean value +CHECKPOINT (NONEXISTENT); +ERROR: unrecognized CHECKPOINT option "nonexistent" +LINE 1: CHECKPOINT (NONEXISTENT); Testing for negative cases like those two last ones is fine by me, but I don't like much the idea of running 5 checkpoints as part of the main regression test suite (think installcheck with a large shared buffer pool for example). --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -15,6 +15,8 @@ #ifndef _BGWRITER_H #define _BGWRITER_H +#include "nodes/parsenodes.h" +#include "parser/parse_node.h" I don't think you need to include parsenodes.h here. +void +ExecCheckPointStmt(ParseState *pstate, CheckPointStmt *stmt) +{ Nit: perhaps this could just be ExecCheckPoint()? See the existing ExecVacuum(). + flags = CHECKPOINT_WAIT | + (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE) | + (spread ? 0 : CHECKPOINT_IMMEDIATE); The handling done for CHECKPOINT_FORCE and CHECKPOINT_WAIT deserve a comment. -- Michael
Attachment
Thanks for reviewing. On 12/4/20, 5:44 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Sat, Dec 05, 2020 at 12:11:13AM +0000, Bossart, Nathan wrote: >> On 12/4/20, 3:33 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: >>> Instead of adding checkpt_option_list, how about utility_option_list? >>> It seems intended for reuse. > > +1. It is intended for reuse. > >> Ah, good call. That simplifies the grammar changes quite a bit. > > +CHECKPOINT; > +CHECKPOINT (SPREAD); > +CHECKPOINT (SPREAD FALSE); > +CHECKPOINT (SPREAD ON); > +CHECKPOINT (SPREAD 0); > +CHECKPOINT (SPREAD 2); > +ERROR: spread requires a Boolean value > +CHECKPOINT (NONEXISTENT); > +ERROR: unrecognized CHECKPOINT option "nonexistent" > +LINE 1: CHECKPOINT (NONEXISTENT); > Testing for negative cases like those two last ones is fine by me, but > I don't like much the idea of running 5 checkpoints as part of the > main regression test suite (think installcheck with a large shared > buffer pool for example). > > --- a/src/include/postmaster/bgwriter.h > +++ b/src/include/postmaster/bgwriter.h > @@ -15,6 +15,8 @@ > #ifndef _BGWRITER_H > #define _BGWRITER_H > > +#include "nodes/parsenodes.h" > +#include "parser/parse_node.h" > I don't think you need to include parsenodes.h here. > > +void > +ExecCheckPointStmt(ParseState *pstate, CheckPointStmt *stmt) > +{ > Nit: perhaps this could just be ExecCheckPoint()? See the existing > ExecVacuum(). > > + flags = CHECKPOINT_WAIT | > + (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE) | > + (spread ? 0 : CHECKPOINT_IMMEDIATE); > The handling done for CHECKPOINT_FORCE and CHECKPOINT_WAIT deserve > a comment. This all seems reasonable to me. I've attached a new version of the patch. Nathan
Attachment
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > This all seems reasonable to me. I've attached a new version of the > patch. diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml index 2afee6d7b5..2b1e56fbd7 100644 --- a/doc/src/sgml/ref/checkpoint.sgml +++ b/doc/src/sgml/ref/checkpoint.sgml + <para> + Note that the server may consolidate concurrently requested checkpoints or + restartpoints. Such consolidated requests will contain a combined set of + options. For example, if one session requested a spread checkpoint and + another session requested a fast checkpoint, the server may combine these + requests and perform one fast checkpoint. + </para> [ ... ] + <refsect1> + <title>Parameters</title> + + <variablelist> + <varlistentry> + <term><literal>SPREAD</literal></term> + <listitem> + <para> + Specifies that checkpoint activity should be throttled based on the + setting for the <xref linkend="guc-checkpoint-completion-target"/> + parameter. If the option is turned off, <command>CHECKPOINT</command> + creates a checkpoint or restartpoint as fast as possible. By default, + <literal>SPREAD</literal> is turned off, and the checkpoint activity is + not throttled. + </para> + </listitem> + </varlistentry> So- just to be clear, CHECKPOINTs are more-or-less always happening in PG, and running this command might do something or might end up doing nothing depending on if a checkpoint is already in progress and this request just gets consolidated into an existing one, and it won't actually reduce the amount of WAL replay except in the case where checkpoint completion target is set to make a checkpoint happen in less time than checkpoint timeout, which ultimately isn't a great way to run the system anyway. Assuming we actually want to do this, which I still generally don't agree with since it isn't really clear if it'll actually end up doing something, or not, wouldn't it make more sense to have a command that just sits and waits for the currently running (or next) checkpoint to complete..? For the use-case that was explained, at least, we don't actually need to cause another checkpoint to happen, we just want to know when a checkpoint has completed, right? Is there some other reason for this that isn't being explained..? Thanks, Stephen
Attachment
On 12/5/20, 6:41 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > Assuming we actually want to do this, which I still generally don't > agree with since it isn't really clear if it'll actually end up doing > something, or not, wouldn't it make more sense to have a command that > just sits and waits for the currently running (or next) checkpoint to > complete..? For the use-case that was explained, at least, we don't > actually need to cause another checkpoint to happen, we just want to > know when a checkpoint has completed, right? If it's enough to just wait for the current checkpoint to complete or to wait for the next one to complete, I suppose you could just poll pg_control_checkpoint(). I think the only downside is that you could end up sitting idle for a while, especially if checkpoint_timeout is high and checkpoint_completion_target is low. But, as you point out, that may not be a typically recommended way to configure the system. Nathan
Greetings, * Bossart, Nathan (bossartn@amazon.com) wrote: > On 12/5/20, 6:41 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > > Assuming we actually want to do this, which I still generally don't > > agree with since it isn't really clear if it'll actually end up doing > > something, or not, wouldn't it make more sense to have a command that > > just sits and waits for the currently running (or next) checkpoint to > > complete..? For the use-case that was explained, at least, we don't > > actually need to cause another checkpoint to happen, we just want to > > know when a checkpoint has completed, right? > > If it's enough to just wait for the current checkpoint to complete or > to wait for the next one to complete, I suppose you could just poll > pg_control_checkpoint(). I think the only downside is that you could > end up sitting idle for a while, especially if checkpoint_timeout is > high and checkpoint_completion_target is low. But, as you point out, > that may not be a typically recommended way to configure the system. Maybe I missed something, but aren't you going to be waiting a while with this patch given that it's asking for a spread checkpoint too..? I agree that you could just monitor for the next checkpoint using pg_control_checkpoint(), which is why I'm wondering again what the point is behind this patch... I'm trying to understand why we'd be encouraging people to increase the number of checkpoints that are happening when they're still going to be waiting around for that spread (in other words, non-immediate) checkpoint to happen (just as if they'd just waited until the next regular checkpoint), and they're still going to have a fair bit of WAL to replay because it'll be however much WAL has been written since we started the spread/non-immediate checkpoint. Thanks, Stephen
Attachment
On 2020-Dec-05, Stephen Frost wrote: > So- just to be clear, CHECKPOINTs are more-or-less always happening in > PG, and running this command might do something or might end up doing > nothing depending on if a checkpoint is already in progress and this > request just gets consolidated into an existing one, and it won't > actually reduce the amount of WAL replay except in the case where > checkpoint completion target is set to make a checkpoint happen in less > time than checkpoint timeout, which ultimately isn't a great way to run > the system anyway. You keep making this statement, and I don't necessarily disagree, but if that is the case, please explain why don't we have checkpoint_completion_target set to 0.9 by default? Should we change that? > Assuming we actually want to do this, which I still generally don't > agree with since it isn't really clear if it'll actually end up doing > something, or not, wouldn't it make more sense to have a command that > just sits and waits for the currently running (or next) checkpoint to > complete..? For the use-case that was explained, at least, we don't > actually need to cause another checkpoint to happen, we just want to > know when a checkpoint has completed, right? Yes, I agree that the use case for this is unclear.
Greetings, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > On 2020-Dec-05, Stephen Frost wrote: > > So- just to be clear, CHECKPOINTs are more-or-less always happening in > > PG, and running this command might do something or might end up doing > > nothing depending on if a checkpoint is already in progress and this > > request just gets consolidated into an existing one, and it won't > > actually reduce the amount of WAL replay except in the case where > > checkpoint completion target is set to make a checkpoint happen in less > > time than checkpoint timeout, which ultimately isn't a great way to run > > the system anyway. > > You keep making this statement, and I don't necessarily disagree, but if > that is the case, please explain why don't we have > checkpoint_completion_target set to 0.9 by default? Should we change > that? Yes, I do think we should change that.. In fact, I'd argue that we can probably get rid of checkpoint_completion_target entirely as an option. The main argument against that is that it could be annoying for people upgrading, but changing the default to 0.9 would definitely be an improvement. Thanks, Stephen
Attachment
On Sun, Dec 06, 2020 at 10:03:08AM -0500, Stephen Frost wrote: > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: >> You keep making this statement, and I don't necessarily disagree, but if >> that is the case, please explain why don't we have >> checkpoint_completion_target set to 0.9 by default? Should we change >> that? > > Yes, I do think we should change that.. Agreed. FWIW, no idea for others, but it is one of those parameters I keep telling to update after a default installation. > In fact, I'd argue that we can > probably get rid of checkpoint_completion_target entirely as an option. > The main argument against that is that it could be annoying for people > upgrading, but changing the default to 0.9 would definitely be an > improvement. Not sure there is enough ground to do that though. -- Michael
Attachment
On 12/5/20, 9:11 AM, "Stephen Frost" <sfrost@snowman.net> wrote: > * Bossart, Nathan (bossartn@amazon.com) wrote: >> On 12/5/20, 6:41 AM, "Stephen Frost" <sfrost@snowman.net> wrote: >> > Assuming we actually want to do this, which I still generally don't >> > agree with since it isn't really clear if it'll actually end up doing >> > something, or not, wouldn't it make more sense to have a command that >> > just sits and waits for the currently running (or next) checkpoint to >> > complete..? For the use-case that was explained, at least, we don't >> > actually need to cause another checkpoint to happen, we just want to >> > know when a checkpoint has completed, right? >> >> If it's enough to just wait for the current checkpoint to complete or >> to wait for the next one to complete, I suppose you could just poll >> pg_control_checkpoint(). I think the only downside is that you could >> end up sitting idle for a while, especially if checkpoint_timeout is >> high and checkpoint_completion_target is low. But, as you point out, >> that may not be a typically recommended way to configure the system. > > Maybe I missed something, but aren't you going to be waiting a while > with this patch given that it's asking for a spread checkpoint too..? > > I agree that you could just monitor for the next checkpoint using > pg_control_checkpoint(), which is why I'm wondering again what the > point is behind this patch... I'm trying to understand why we'd be > encouraging people to increase the number of checkpoints that are > happening when they're still going to be waiting around for that spread > (in other words, non-immediate) checkpoint to happen (just as if they'd > just waited until the next regular checkpoint), and they're still going > to have a fair bit of WAL to replay because it'll be however much WAL > has been written since we started the spread/non-immediate checkpoint. I plan to mark this patch as withdrawn after the next commitfest unless anyone objects. Nathan
On Mon, Dec 7, 2020 at 11:22:01AM +0900, Michael Paquier wrote: > On Sun, Dec 06, 2020 at 10:03:08AM -0500, Stephen Frost wrote: > > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > >> You keep making this statement, and I don't necessarily disagree, but if > >> that is the case, please explain why don't we have > >> checkpoint_completion_target set to 0.9 by default? Should we change > >> that? > > > > Yes, I do think we should change that.. > > Agreed. FWIW, no idea for others, but it is one of those parameters I > keep telling to update after a default installation. +1 for making it 0.9. FYI< Robert Haas has argued that our estimation of completion isn't great, which is why it defaults to 0.5. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee