Thread: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Jeff Janes
Date:
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested Simple patch, applies and makes cleanly, does what it says and says what it does. If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. Marking it ready for committer. The new status of this patch is: Ready for Committer
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Nov 2, 2015 at 2:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Simple patch, applies and makes cleanly, does what it says and says what it does. > > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. > > Marking it ready for committer. > > The new status of this patch is: Ready for Committer Thanks! That was deadly fast. Just wondering: shouldn't we keep the discussion around this patch on -bugs instead? Not saying you are wrong, Jeff, I am just not sure what would be the best practice regarding patches related to bugs. I would think that it is at least necessary to keep the person who reported the bug in CC to let him know the progress though. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Jeff Janes
Date:
On Sun, Nov 1, 2015 at 11:09 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Nov 2, 2015 at 2:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Simple patch, applies and makes cleanly, does what it says and says what it does. >> >> If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. >> >> Marking it ready for committer. >> >> The new status of this patch is: Ready for Committer > > Thanks! That was deadly fast. > > Just wondering: shouldn't we keep the discussion around this patch on > -bugs instead? Not saying you are wrong, Jeff, I am just not sure what > would be the best practice regarding patches related to bugs. I would > think that it is at least necessary to keep the person who reported > the bug in CC to let him know the progress though. I wasn't sure about -bugs vs -hackers either, but in this case I used the review form built into the commit-fest app, and the app is what sent the email. As far as I know I can't change its destination or its CC list, even if I had thought ahead to do so. I think the bug reporter should certainly be CCed when the bug is closed, I don't know about intermediate steps in the "sausage making" process. Something to think about for a bug-tracker we might implement in the future. I think most bugs are summarily handled by committers, so don't go through the commitfest process at all. Cheers, Jeff
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. I'm sure other people here understand this better than me, but I wonder if it wouldn't make more sense to somehow log this data only if something material has changed in the data being logged. This seems to be trying to log something only if something else has been written to WAL, which I'm not sure is the right test. Also, this check here: + if (last_snapshot_lsn != insert_lsn && + checkpoint_lsn != insert_lsn && + checkpoint_lsn != previous_lsn) ...seems like it will fire if there have been 0 or 1 WAL records since the last checkpoint, regardless of what they are. I'm not sure that's the right test, and it'll break again the minute we have a third thing we want to log only if the system is non-idle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2015-11-03 10:23:35 -0500, Robert Haas wrote: > On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. > > I'm sure other people here understand this better than me, but I > wonder if it wouldn't make more sense to somehow log this data only if > something material has changed in the data being logged. Phew. That doesn't seem easy to measure. I'm doubtful that it's worth comparing the snapshot and such, especially in the back branches. We could maybe add something that we only log a snapshot if XXX megabytes have been logged or something. But I don't know which number to pick here - and if there's other write activity the price of a snapshot record really isn't high. Andres
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Nov 4, 2015 at 12:43 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-03 10:23:35 -0500, Robert Haas wrote: >> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. >> >> I'm sure other people here understand this better than me, but I >> wonder if it wouldn't make more sense to somehow log this data only if >> something material has changed in the data being logged. > > Phew. That doesn't seem easy to measure. I'm doubtful that it's worth > comparing the snapshot and such, especially in the back > branches. Well, I guess that's why I thought it would be more simple to check if we are at the beginning of a segment at first sight. This has no chance to break if anything else like that is being added in the future as it doesn't depend on the record types, though new similar records added on a timely manner would need a similar check. Perhaps this could be coupled by a check on the last XLOG_SWITCH_XLOG record instead of checkpoint activity though. > We could maybe add something that we only log a snapshot if XXX > megabytes have been logged or something. But I don't know which number > to pick here - and if there's other write activity the price of a > snapshot record really isn't high. On a completely idle system, I don't think we should log any standby records. This is what ~9.3 does. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote: >On Wed, Nov 4, 2015 at 12:43 AM, Andres Freund <andres@anarazel.de> >wrote: >> On 2015-11-03 10:23:35 -0500, Robert Haas wrote: >>> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> >wrote: >>> > If a transaction holding locks aborts on an otherwise idle server, >perhaps it will take a very long time for a log-shipping standby to >realize this. But I have hard time believing that anyone who cares >about that would be using log-shipping (rather than streaming) anyway. >>> >>> I'm sure other people here understand this better than me, but I >>> wonder if it wouldn't make more sense to somehow log this data only >if >>> something material has changed in the data being logged. >> >> Phew. That doesn't seem easy to measure. I'm doubtful that it's worth >> comparing the snapshot and such, especially in the back >> branches. > >Well, I guess that's why I thought it would be more simple to check if >we are at the beginning of a segment at first sight. This has no >chance to break if anything else like that is being added in the >future as it doesn't depend on the record types, though new similar >records added on a timely manner would need a similar check. Perhaps >this could be coupled by a check on the last XLOG_SWITCH_XLOG record >instead of checkpoint activity though. > >> We could maybe add something that we only log a snapshot if XXX >> megabytes have been logged or something. But I don't know which >number >> to pick here - and if there's other write activity the price of a >> snapshot record really isn't high. > >On a completely idle system, I don't think we should log any standby >records. This is what ~9.3 does. Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn't seemto be the case. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Nov 4, 2015 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote: > On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier wrote: >>On a completely idle system, I don't think we should log any standby >>records. This is what ~9.3 does. > > Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn't seemto be the case. Er, yes, sorry. I should have used clearer words: I meant idle system with something running nothing including internal checkpoints. An instance indeed generates a XLOG_RUNNING_XACTS record before a checkpoint record on an idle system. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Nov 4, 2015 at 12:43 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-03 10:23:35 -0500, Robert Haas wrote: >> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a log-shippingstandby to realize this. But I have hard time believing that anyone who cares about that would be using log-shipping(rather than streaming) anyway. >> >> I'm sure other people here understand this better than me, but I >> wonder if it wouldn't make more sense to somehow log this data only if >> something material has changed in the data being logged. > > Phew. That doesn't seem easy to measure. I'm doubtful that it's worth > comparing the snapshot and such, especially in the back > branches. > > We could maybe add something that we only log a snapshot if XXX > megabytes have been logged or something. But I don't know which number > to pick here - and if there's other write activity the price of a > snapshot record really isn't high. My first guess on the matter is that we would like to have an extra condition that depends on max_wal_size with at least a minimum number of segments generated since the last standby snapshot, perhaps max_wal_size / 16, but this coefficient is clearly a rule of thumb. With the default configuration of 1GB, that would be waiting for 4 segments to be generated before logging in a standby snapshot. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2015-11-04 16:01:28 +0900, Michael Paquier wrote: > On Wed, Nov 4, 2015 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote: > > On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier wrote: > >>On a completely idle system, I don't think we should log any standby > >>records. This is what ~9.3 does. > > > > Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn't seemto be the case. > > Er, yes, sorry. I should have used clearer words: I meant idle system > with something running nothing including internal checkpoints. Uh, but you'll always have checkpoints happen on wal_level = hot_standby, even in 9.3? Maybe I'm not parsing your sentence right. As soon as a single checkpoint ever happened the early-return logic in CreateCheckPoint() will fail to take the LogStandbySnapshot() in CreateCheckPoint() into account. The test is: if (curInsert == ControlFile->checkPoint +MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint))&&ControlFile->checkPoint == ControlFile->checkPointCopy.redo) which obviously doesn't work if there's been a WAL record logged after the redo pointer has been determined etc. The reason that a single checkpoint is needed to "jumpstart" the pointless checkpoints is that otherwise we'll never have issued a LogStandbySnapshot() and thus the above code block works if we started from a proper shutdown checkpoint. Independent of the idle issue, it seems to me that the location of the LogStandbySnapshot() is actually rather suboptimal - it really should really be before the CheckPointGuts(), not afterwards. As closer it's to the redo pointer of the checkpoint a hot standby node starts up from, the sooner that node can reach consistency. There's no difference for the first time a node starts from a basebackup (since we gotta replay that checkpoint anyway before we're consistent), but if we start from a restartpoint... Greetings, Andres Freund
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Nov 4, 2015 at 7:33 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-04 16:01:28 +0900, Michael Paquier wrote: >> On Wed, Nov 4, 2015 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote: >> > On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier wrote: >> >>On a completely idle system, I don't think we should log any standby >> >>records. This is what ~9.3 does. >> > >> > Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn'tseem to be the case. >> >> Er, yes, sorry. I should have used clearer words: I meant idle system >> with something running nothing including internal checkpoints. > > Uh, but you'll always have checkpoints happen on wal_level = > hot_standby, even in 9.3? Maybe I'm not parsing your sentence right. Reading again my previous sentence I cannot get the meaning of it myself :) Well, I just meant that in ~9.3 LogStandbySnapshot() is called at each checkpoint, checkpoints occurring after checkpoint_timeout even if the system is idle. > As soon as a single checkpoint ever happened the early-return logic in > CreateCheckPoint() will fail to take the LogStandbySnapshot() in > CreateCheckPoint() into account. The test is: > if (curInsert == ControlFile->checkPoint + > MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) && > ControlFile->checkPoint == ControlFile->checkPointCopy.redo) > which obviously doesn't work if there's been a WAL record logged after > the redo pointer has been determined etc. Yes. If segment switches are enforced at a pace faster than checkpoint_timeout, this check considers that a checkpoint needs to happen because a SWITCH_XLOG record is in-between. I am a bit surprised that this should happen actually. The segment switch triggers a checkpoint record, and vice-versa, even for idle systems. Shouldn't we make this check a bit smarter then? > The reason that a single checkpoint is needed to "jumpstart" the > pointless checkpoints is that otherwise we'll never have issued a > LogStandbySnapshot() and thus the above code block works if we started > from a proper shutdown checkpoint. > > Independent of the idle issue, it seems to me that the location of the > LogStandbySnapshot() is actually rather suboptimal - it really should > really be before the CheckPointGuts(), not afterwards. As closer it's to > the redo pointer of the checkpoint a hot standby node starts up from, > the sooner that node can reach consistency. There's no difference for > the first time a node starts from a basebackup (since we gotta replay > that checkpoint anyway before we're consistent), but if we start from a > restartpoint... Agreed. LogStandbySnapshot() is called after CheckPointGuts() since its introduction in efc16ea5. This may save time. This would surely be a master-only optimization though. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Thu, Nov 5, 2015 at 3:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Nov 4, 2015 at 7:33 PM, Andres Freund wrote: >> As soon as a single checkpoint ever happened the early-return logic in >> CreateCheckPoint() will fail to take the LogStandbySnapshot() in >> CreateCheckPoint() into account. The test is: >> if (curInsert == ControlFile->checkPoint + >> MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) && >> ControlFile->checkPoint == ControlFile->checkPointCopy.redo) >> which obviously doesn't work if there's been a WAL record logged after >> the redo pointer has been determined etc. > > Yes. If segment switches are enforced at a pace faster than > checkpoint_timeout, this check considers that a checkpoint needs to > happen because a SWITCH_XLOG record is in-between. I am a bit > surprised that this should happen actually. The segment switch > triggers a checkpoint record, and vice-versa, even for idle systems. > Shouldn't we make this check a bit smarter then? Ah, the documentation clearly explains that setting archive_timeout will enforce a segment switch if any activity occurred, including a checkpoint: http://www.postgresql.org/docs/devel/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-ARCHIVING I missed that, sorry. I have as well thought a bit about adding a space-related constraint on the standby snapshot generated by the bgwriter, so as to not rely entirely on the interval of 15s. I finished with the attached that uses a check based on CheckPointSegments / 8 to be sure that at least this number of segments has been generated since the last checkpoint before logging a new snapshot. I guess that's less brittle than the last patch. Thoughts? -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > I have as well thought a bit about adding a space-related constraint > on the standby snapshot generated by the bgwriter, so as to not rely > entirely on the interval of 15s. I finished with the attached that > uses a check based on CheckPointSegments / 8 to be sure that at least > this number of segments has been generated since the last checkpoint > before logging a new snapshot. I guess that's less brittle than the > last patch. Thoughts? I can't see why that would be a good idea. My understanding is that the logical decoding code needs to get those messages pretty regularly, and I don't see why that need would be reduced on systems where CheckPointSegments is large. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2015-11-06 11:42:56 -0500, Robert Haas wrote: > On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > I have as well thought a bit about adding a space-related constraint > > on the standby snapshot generated by the bgwriter, so as to not rely > > entirely on the interval of 15s. I finished with the attached that > > uses a check based on CheckPointSegments / 8 to be sure that at least > > this number of segments has been generated since the last checkpoint > > before logging a new snapshot. I guess that's less brittle than the > > last patch. Thoughts? > > I can't see why that would be a good idea. My understanding is that > the logical decoding code needs to get those messages pretty > regularly, and I don't see why that need would be reduced on systems > where CheckPointSegments is large. Precisely. What I'm thinking of right now is a marker somewhere in shared memory, that tells whether anything worthwhile has happened since the last determination of the redo pointer. Where standby snapshots don't count. That seems like it'd be to maintain going forward than doing precise size calculations like CreateCheckPoint() already does, and would additionally need to handle its own standby snapshot, not to speak of the background ones. Seems like it'd be doable in ReserveXLogInsertLocation(). Whether it's actually worthwhile I'm not all that sure tho. Andres
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Fri, Nov 6, 2015 at 11:52 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-06 11:42:56 -0500, Robert Haas wrote: >> On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> > I have as well thought a bit about adding a space-related constraint >> > on the standby snapshot generated by the bgwriter, so as to not rely >> > entirely on the interval of 15s. I finished with the attached that >> > uses a check based on CheckPointSegments / 8 to be sure that at least >> > this number of segments has been generated since the last checkpoint >> > before logging a new snapshot. I guess that's less brittle than the >> > last patch. Thoughts? >> >> I can't see why that would be a good idea. My understanding is that >> the logical decoding code needs to get those messages pretty >> regularly, and I don't see why that need would be reduced on systems >> where CheckPointSegments is large. > > Precisely. > > What I'm thinking of right now is a marker somewhere in shared memory, > that tells whether anything worthwhile has happened since the last > determination of the redo pointer. Where standby snapshots don't > count. That seems like it'd be to maintain going forward than doing > precise size calculations like CreateCheckPoint() already does, and > would additionally need to handle its own standby snapshot, not to speak > of the background ones. Good idea. > Seems like it'd be doable in ReserveXLogInsertLocation(). > > Whether it's actually worthwhile I'm not all that sure tho. Why not? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On November 6, 2015 6:21:50 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote: >On Fri, Nov 6, 2015 at 11:52 AM, Andres Freund <andres@anarazel.de> >wrote: >> Seems like it'd be doable in ReserveXLogInsertLocation(). >> >> Whether it's actually worthwhile I'm not all that sure tho. > >Why not? Adds another instruction in one of the hottest spinlock protected sections of PG. Probably won't be significant, but... --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Fri, Nov 6, 2015 at 12:26 PM, Andres Freund <andres@anarazel.de> wrote: > On November 6, 2015 6:21:50 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote: >>On Fri, Nov 6, 2015 at 11:52 AM, Andres Freund <andres@anarazel.de> >>wrote: >>> Seems like it'd be doable in ReserveXLogInsertLocation(). >>> >>> Whether it's actually worthwhile I'm not all that sure tho. >> >>Why not? > > Adds another instruction in one of the hottest spinlock protected sections of PG. Probably won't be significant, but... Oh. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Nov 7, 2015 at 1:52 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-06 11:42:56 -0500, Robert Haas wrote: >> On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> > I have as well thought a bit about adding a space-related constraint >> > on the standby snapshot generated by the bgwriter, so as to not rely >> > entirely on the interval of 15s. I finished with the attached that >> > uses a check based on CheckPointSegments / 8 to be sure that at least >> > this number of segments has been generated since the last checkpoint >> > before logging a new snapshot. I guess that's less brittle than the >> > last patch. Thoughts? >> >> I can't see why that would be a good idea. My understanding is that >> the logical decoding code needs to get those messages pretty >> regularly, and I don't see why that need would be reduced on systems >> where CheckPointSegments is large. > > Precisely. > > > What I'm thinking of right now is a marker somewhere in shared memory, > that tells whether anything worthwhile has happened since the last > determination of the redo pointer. Where standby snapshots don't > count. That seems like it'd be to maintain going forward than doing > precise size calculations like CreateCheckPoint() already does, and > would additionally need to handle its own standby snapshot, not to speak > of the background ones. I thought about something like that at some point by saving a minimum activity pointer in XLogCtl, updated each time a segment was forcibly switched or after inserting a checkpoint record. Then the bgwriter looked at if the current insert position matched this minimum activity pointer, skipping LogStandbySnapshot if both positions match. Does this match your line of thoughts? > Seems like it'd be doable in ReserveXLogInsertLocation(). > Whether it's actually worthwhile I'm not all that sure tho. I am not sure doing extra work there is a good idea. There are already 5 instructions within the spinlock section, and this code path is already high in many profiles for busy systems. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote: > I thought about something like that at some point by saving a minimum > activity pointer in XLogCtl, updated each time a segment was forcibly > switched or after inserting a checkpoint record. Then the bgwriter > looked at if the current insert position matched this minimum activity > pointer, skipping LogStandbySnapshot if both positions match. Does > this match your line of thoughts? Looking at the code, it occurred to me that the LSN position saved for a XLOG_SWITCH record is the last position of current segment, so we would still need to check if the current insert LSN matches the beginning of a new segment and if the last segment was forcibly switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example. Thoughts? -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote: >> I thought about something like that at some point by saving a minimum >> activity pointer in XLogCtl, updated each time a segment was forcibly >> switched or after inserting a checkpoint record. Then the bgwriter >> looked at if the current insert position matched this minimum activity >> pointer, skipping LogStandbySnapshot if both positions match. Does >> this match your line of thoughts? > > Looking at the code, it occurred to me that the LSN position saved for > a XLOG_SWITCH record is the last position of current segment, so we > would still need to check if the current insert LSN matches the > beginning of a new segment and if the last segment was forcibly > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example. > Thoughts? I haven't given up on this patch yet, and putting again my head on this problem I have finished with the patch attached, which checks if the current insert LSN position is at the beginning of a segment that has just been switched to decide if a standby snapshot should be logged or not. This allows bringing back an idle system to the pre-9.3 state where a segment would be archived in the case of a low archive_timeout only when a checkpoint has been issued on the system. In order to achieve this idea I have added a field on XLogCtl that saves the last time a segment has been forcibly switched after XLOG_SWITCH. Honestly I am failing to see why we should track the progress since last checkpoint as mentioned upthread, and the current behavior is certainly a regression. Speaking of which, this patch was registered in this CF, I am moving it to the next as a bug fix. Regards, -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Speaking of which, this patch was registered in this CF, I am moving > it to the next as a bug fix. I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it is possible that the return LSN pointer is on the new segment that has been forcibly archived if RequestXLogSwitch is called multiple times and that subsequent calls are not necessary. Patch updated. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> >> I thought about something like that at some point by saving a minimum
> >> activity pointer in XLogCtl, updated each time a segment was forcibly
> >> switched or after inserting a checkpoint record. Then the bgwriter
> >> looked at if the current insert position matched this minimum activity
> >> pointer, skipping LogStandbySnapshot if both positions match. Does
> >> this match your line of thoughts?
> >
> > Looking at the code, it occurred to me that the LSN position saved for
> > a XLOG_SWITCH record is the last position of current segment, so we
> > would still need to check if the current insert LSN matches the
> > beginning of a new segment and if the last segment was forcibly
> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> > Thoughts?
>
> I haven't given up on this patch yet, and putting again my head on
> this problem I have finished with the patch attached, which checks if
> the current insert LSN position is at the beginning of a segment that
> has just been switched to decide if a standby snapshot should be
> logged or not. This allows bringing back an idle system to the pre-9.3
> state where a segment would be archived in the case of a low
> archive_timeout only when a checkpoint has been issued on the system.
> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> >> I thought about something like that at some point by saving a minimum
> >> activity pointer in XLogCtl, updated each time a segment was forcibly
> >> switched or after inserting a checkpoint record. Then the bgwriter
> >> looked at if the current insert position matched this minimum activity
> >> pointer, skipping LogStandbySnapshot if both positions match. Does
> >> this match your line of thoughts?
> >
> > Looking at the code, it occurred to me that the LSN position saved for
> > a XLOG_SWITCH record is the last position of current segment, so we
> > would still need to check if the current insert LSN matches the
> > beginning of a new segment and if the last segment was forcibly
> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> > Thoughts?
>
> I haven't given up on this patch yet, and putting again my head on
> this problem I have finished with the patch attached, which checks if
> the current insert LSN position is at the beginning of a segment that
> has just been switched to decide if a standby snapshot should be
> logged or not. This allows bringing back an idle system to the pre-9.3
> state where a segment would be archived in the case of a low
> archive_timeout only when a checkpoint has been issued on the system.
>
Won't this be a problem if the checkpoint occurs after a long time and in
the mean time there is some activity in the server?
Another idea to solve this issue could be to see if there is any progress
in the server by checking buffers dirtied/written (we can refer that
information using pgBufferUsage) since last time we log this record in
bgwriter.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote: >> >> I thought about something like that at some point by saving a minimum >> >> activity pointer in XLogCtl, updated each time a segment was forcibly >> >> switched or after inserting a checkpoint record. Then the bgwriter >> >> looked at if the current insert position matched this minimum activity >> >> pointer, skipping LogStandbySnapshot if both positions match. Does >> >> this match your line of thoughts? >> > >> > Looking at the code, it occurred to me that the LSN position saved for >> > a XLOG_SWITCH record is the last position of current segment, so we >> > would still need to check if the current insert LSN matches the >> > beginning of a new segment and if the last segment was forcibly >> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example. >> > Thoughts? >> >> I haven't given up on this patch yet, and putting again my head on >> this problem I have finished with the patch attached, which checks if >> the current insert LSN position is at the beginning of a segment that >> has just been switched to decide if a standby snapshot should be >> logged or not. This allows bringing back an idle system to the pre-9.3 >> state where a segment would be archived in the case of a low >> archive_timeout only when a checkpoint has been issued on the system. >> > > Won't this be a problem if the checkpoint occurs after a long time and in > the mean time there is some activity in the server? Why? If there is some activity on the server, the snapshot will be immediately taken at the next iteration without caring about the checkpoint. > Another idea to solve this issue could be to see if there is any progress > in the server by checking buffers dirtied/written (we can refer that > information using pgBufferUsage) since last time we log this record in > bgwriter. Yes, that may be an idea worth considering, but I really think that we had better measure that at WAL level.. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
> >> <michael.paquier@gmail.com> wrote:
> >> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> >> >> I thought about something like that at some point by saving a minimum
> >> >> activity pointer in XLogCtl, updated each time a segment was forcibly
> >> >> switched or after inserting a checkpoint record. Then the bgwriter
> >> >> looked at if the current insert position matched this minimum activity
> >> >> pointer, skipping LogStandbySnapshot if both positions match. Does
> >> >> this match your line of thoughts?
> >> >
> >> > Looking at the code, it occurred to me that the LSN position saved for
> >> > a XLOG_SWITCH record is the last position of current segment, so we
> >> > would still need to check if the current insert LSN matches the
> >> > beginning of a new segment and if the last segment was forcibly
> >> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> >> > Thoughts?
> >>
> >> I haven't given up on this patch yet, and putting again my head on
> >> this problem I have finished with the patch attached, which checks if
> >> the current insert LSN position is at the beginning of a segment that
> >> has just been switched to decide if a standby snapshot should be
> >> logged or not. This allows bringing back an idle system to the pre-9.3
> >> state where a segment would be archived in the case of a low
> >> archive_timeout only when a checkpoint has been issued on the system.
> >>
> >
> > Won't this be a problem if the checkpoint occurs after a long time and in
> > the mean time there is some activity in the server?
>
> Why? If there is some activity on the server, the snapshot will be
> immediately taken at the next iteration without caring about the
> checkpoint.
>
>
> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
> >> <michael.paquier@gmail.com> wrote:
> >> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> >> >> I thought about something like that at some point by saving a minimum
> >> >> activity pointer in XLogCtl, updated each time a segment was forcibly
> >> >> switched or after inserting a checkpoint record. Then the bgwriter
> >> >> looked at if the current insert position matched this minimum activity
> >> >> pointer, skipping LogStandbySnapshot if both positions match. Does
> >> >> this match your line of thoughts?
> >> >
> >> > Looking at the code, it occurred to me that the LSN position saved for
> >> > a XLOG_SWITCH record is the last position of current segment, so we
> >> > would still need to check if the current insert LSN matches the
> >> > beginning of a new segment and if the last segment was forcibly
> >> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> >> > Thoughts?
> >>
> >> I haven't given up on this patch yet, and putting again my head on
> >> this problem I have finished with the patch attached, which checks if
> >> the current insert LSN position is at the beginning of a segment that
> >> has just been switched to decide if a standby snapshot should be
> >> logged or not. This allows bringing back an idle system to the pre-9.3
> >> state where a segment would be archived in the case of a low
> >> archive_timeout only when a checkpoint has been issued on the system.
> >>
> >
> > Won't this be a problem if the checkpoint occurs after a long time and in
> > the mean time there is some activity in the server?
>
> Why? If there is some activity on the server, the snapshot will be
> immediately taken at the next iteration without caring about the
> checkpoint.
>
+ (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))
Do you mean to intend that it is protected by above check in the
patch?
Isn't it possible that so much WAL is inserted between bgwriter cycles,
that when it checks the location of WAL, it founds it to be at the beginning
of a new segment?
> > Another idea to solve this issue could be to see if there is any progress
> > in the server by checking buffers dirtied/written (we can refer that
> > information using pgBufferUsage) since last time we log this record in
> > bgwriter.
>
> Yes, that may be an idea worth considering, but I really think that we
> had better measure that at WAL level..
>
> > in the server by checking buffers dirtied/written (we can refer that
> > information using pgBufferUsage) since last time we log this record in
> > bgwriter.
>
> Yes, that may be an idea worth considering, but I really think that we
> had better measure that at WAL level..
>
I thought this is quite close to the previous patch you proposed where
Andres wanted some measurement in terms of progress since last
checkpoint. I understand as per current code your patch can work, but
what if some more similar WAL records needs to be added?
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier >> > <michael.paquier@gmail.com> wrote: >> >> > Looking at the code, it occurred to me that the LSN position saved >> >> > for >> >> > a XLOG_SWITCH record is the last position of current segment, so we >> >> > would still need to check if the current insert LSN matches the >> >> > beginning of a new segment and if the last segment was forcibly >> >> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for >> >> > example. >> >> > Thoughts? >> >> >> >> I haven't given up on this patch yet, and putting again my head on >> >> this problem I have finished with the patch attached, which checks if >> >> the current insert LSN position is at the beginning of a segment that >> >> has just been switched to decide if a standby snapshot should be >> >> logged or not. This allows bringing back an idle system to the pre-9.3 >> >> state where a segment would be archived in the case of a low >> >> archive_timeout only when a checkpoint has been issued on the system. >> >> >> > >> > Won't this be a problem if the checkpoint occurs after a long time and >> > in >> > the mean time there is some activity in the server? >> >> Why? If there is some activity on the server, the snapshot will be >> immediately taken at the next iteration without caring about the >> checkpoint. >> > > + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD)) > > Do you mean to intend that it is protected by above check in the > patch? Yep, in combination with is_switch_current to check if the last completed segment was forcibly switched. > Isn't it possible that so much WAL is inserted between bgwriter cycles, > that when it checks the location of WAL, it founds it to be at the beginning > of a new segment? Er, no. Even if the insert LSN is at the beginning of a new segment, this would take a standby snapshot if the last segment was not forcibly switched. >> > Another idea to solve this issue could be to see if there is any >> > progress >> > in the server by checking buffers dirtied/written (we can refer that >> > information using pgBufferUsage) since last time we log this record in >> > bgwriter. >> >> Yes, that may be an idea worth considering, but I really think that we >> had better measure that at WAL level.. > > I thought this is quite close to the previous patch you proposed where > Andres wanted some measurement in terms of progress since last > checkpoint. I understand as per current code your patch can work, but > what if some more similar WAL records needs to be added? A record like this one for the bgwriter? We could still rely on the same check tracking the last-forced-segment, no? It seems to me that tracking progress here is not really only a matter of the number of shared buffers dirtied or written, we would need as well to track XLOG_STANDBY_LOCK and provide for them fresh snapshots. Imagine for example a read-only load on the master with some exclusive locks taken on relations from time to time (perhaps unlikely so but who knows?). -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > Won't this be a problem if the checkpoint occurs after a long time and
> >> > in
> >> > the mean time there is some activity in the server?
> >>
> >> Why? If there is some activity on the server, the snapshot will be
> >> immediately taken at the next iteration without caring about the
> >> checkpoint.
> >>
> >
> > + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))
> >
> > Do you mean to intend that it is protected by above check in the
> > patch?
>
> Yep, in combination with is_switch_current to check if the last
> completed segment was forcibly switched.
>
> > Isn't it possible that so much WAL is inserted between bgwriter cycles,
> > that when it checks the location of WAL, it founds it to be at the beginning
> > of a new segment?
>
> Er, no. Even if the insert LSN is at the beginning of a new segment,
> this would take a standby snapshot if the last segment was not
> forcibly switched.
>
So here if I understand correctly the check related to the last segment
>
> On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > Won't this be a problem if the checkpoint occurs after a long time and
> >> > in
> >> > the mean time there is some activity in the server?
> >>
> >> Why? If there is some activity on the server, the snapshot will be
> >> immediately taken at the next iteration without caring about the
> >> checkpoint.
> >>
> >
> > + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))
> >
> > Do you mean to intend that it is protected by above check in the
> > patch?
>
> Yep, in combination with is_switch_current to check if the last
> completed segment was forcibly switched.
>
> > Isn't it possible that so much WAL is inserted between bgwriter cycles,
> > that when it checks the location of WAL, it founds it to be at the beginning
> > of a new segment?
>
> Er, no. Even if the insert LSN is at the beginning of a new segment,
> this would take a standby snapshot if the last segment was not
> forcibly switched.
>
So here if I understand correctly the check related to the last segment
forcibly switched is based on the fact that if it is forcibly switched, then
we don't need to log this record? What is the reason of such an
assumption? It is not very clear by reading the comments you have
added in patch, may be you can expand comments slightly to explain
the reason of same.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier >> > <michael.paquier@gmail.com> >> > wrote: >> >> >> >> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> >> >> wrote: >> >> > Won't this be a problem if the checkpoint occurs after a long time >> >> > and >> >> > in >> >> > the mean time there is some activity in the server? >> >> >> >> Why? If there is some activity on the server, the snapshot will be >> >> immediately taken at the next iteration without caring about the >> >> checkpoint. >> >> >> > >> > + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD)) >> > >> > Do you mean to intend that it is protected by above check in the >> > patch? >> >> Yep, in combination with is_switch_current to check if the last >> completed segment was forcibly switched. >> >> > Isn't it possible that so much WAL is inserted between bgwriter cycles, >> > that when it checks the location of WAL, it founds it to be at the >> > beginning >> > of a new segment? >> >> Er, no. Even if the insert LSN is at the beginning of a new segment, >> this would take a standby snapshot if the last segment was not >> forcibly switched. >> > > So here if I understand correctly the check related to the last segment > forcibly switched is based on the fact that if it is forcibly switched, then > we don't need to log this record? What is the reason of such an > assumption? Yes, the thing is that, as mentioned at the beginning of the thread, a low value of archive_timeout causes a segment to be forcibly switched at the end of the timeout defined by this parameter. In combination with the standby snapshot taken in bgwriter since 9.4, this is causing segments to be always switched even if a system is completely idle. Before that, in 9.3 and older versions, we would have a segment forcibly switched on an idle system only once per checkpoint. The documentation is already clear about that actually. The whole point of this patch is to fix this regression, aka switch to a new segment only once per checkpoint. > It is not very clear by reading the comments you have > added in patch, may be you can expand comments slightly to explain > the reason of same. OK. Here are the comments that this patch is adding: - * only log if enough time has passed and some xlog record has - * been inserted. + * Only log if enough time has passed and some xlog record has + * been inserted on a new segment. On an idle system where + * segments can be archived in a fast pace with for example a + * low archive_command setting, avoid as well logging a new + * standby snapshot if the current insert position is still + * at the beginning of the segment that has just been switched. + * + * It is possible that GetXLogLastSwitchPtr points to the last + * position of previous segment or to the first position of the + * new segment after the switch, hence take both cases into + * account when deciding if a standby snapshot should be taken. + * (see comments on top of RequestXLogSwitch for more details). */ It makes mention of the problem that it is trying to fix, perhaps you mean that this should mention the fact that on an idle system standby snapshots should happen at worst once per checkpoint? -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >>
> >
> > So here if I understand correctly the check related to the last segment
> > forcibly switched is based on the fact that if it is forcibly switched, then
> > we don't need to log this record? What is the reason of such an
> > assumption?
>
> Yes, the thing is that, as mentioned at the beginning of the thread, a
> low value of archive_timeout causes a segment to be forcibly switched
> at the end of the timeout defined by this parameter. In combination
> with the standby snapshot taken in bgwriter since 9.4, this is causing
> segments to be always switched even if a system is completely idle.
> Before that, in 9.3 and older versions, we would have a segment
> forcibly switched on an idle system only once per checkpoint.
>
Okay, so this will fix the case where the system is idle, but what I
>
> The
> documentation is already clear about that actually. The whole point of
> this patch is to fix this regression, aka switch to a new segment only
> once per checkpoint.
>
> > It is not very clear by reading the comments you have
> > added in patch, may be you can expand comments slightly to explain
> > the reason of same.
>
> OK. Here are the comments that this patch is adding:
> - * only log if enough time has passed and some xlog record has
> - * been inserted.
> + * Only log if enough time has passed and some xlog record has
> + * been inserted on a new segment. On an idle system where
> + * segments can be archived in a fast pace with for example a
> + * low archive_command setting, avoid as well logging a new
> + * standby snapshot if the current insert position is still
> + * at the beginning of the segment that has just been switched.
> + *
> + * It is possible that GetXLogLastSwitchPtr points to the last
> + * position of previous segment or to the first position of the
> + * new segment after the switch, hence take both cases into
> + * account when deciding if a standby snapshot should be taken.
> + * (see comments on top of RequestXLogSwitch for more details).
> */
> It makes mention of the problem that it is trying to fix, perhaps you
> mean that this should mention the fact that on an idle system standby
> snapshots should happen at worst once per checkpoint?
>
>
> On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >>
> >
> > So here if I understand correctly the check related to the last segment
> > forcibly switched is based on the fact that if it is forcibly switched, then
> > we don't need to log this record? What is the reason of such an
> > assumption?
>
> Yes, the thing is that, as mentioned at the beginning of the thread, a
> low value of archive_timeout causes a segment to be forcibly switched
> at the end of the timeout defined by this parameter. In combination
> with the standby snapshot taken in bgwriter since 9.4, this is causing
> segments to be always switched even if a system is completely idle.
> Before that, in 9.3 and older versions, we would have a segment
> forcibly switched on an idle system only once per checkpoint.
>
Okay, so this will fix the case where the system is idle, but what I
am slightly worried is that it should not miss to log the standby snapshot
due to this check when there is actually some activity in the system.
Why is not possible to have a case such that the segment is forcibly
switched due to reason other than checkpoint (user has requested the
same) and the current insert LSN is at beginning of a new segment
due to write activity? If that is possible, which to me theoretically seems
possible, then I think bgwriter will miss to log the standby snapshot.
>
> The
> documentation is already clear about that actually. The whole point of
> this patch is to fix this regression, aka switch to a new segment only
> once per checkpoint.
>
> > It is not very clear by reading the comments you have
> > added in patch, may be you can expand comments slightly to explain
> > the reason of same.
>
> OK. Here are the comments that this patch is adding:
> - * only log if enough time has passed and some xlog record has
> - * been inserted.
> + * Only log if enough time has passed and some xlog record has
> + * been inserted on a new segment. On an idle system where
> + * segments can be archived in a fast pace with for example a
> + * low archive_command setting, avoid as well logging a new
> + * standby snapshot if the current insert position is still
> + * at the beginning of the segment that has just been switched.
> + *
> + * It is possible that GetXLogLastSwitchPtr points to the last
> + * position of previous segment or to the first position of the
> + * new segment after the switch, hence take both cases into
> + * account when deciding if a standby snapshot should be taken.
> + * (see comments on top of RequestXLogSwitch for more details).
> */
> It makes mention of the problem that it is trying to fix, perhaps you
> mean that this should mention the fact that on an idle system standby
> snapshots should happen at worst once per checkpoint?
>
I mean to say that in below part of comment, explanation about the
the check related to insert position is quite clear whereas why it is
okay to avoid logging standby snapshot when the segment is not
forcibly switched is not apparent.
* avoid as well logging a new
* standby snapshot if the current insert position is still
* at the beginning of the segment that has just been switched.
* standby snapshot if the current insert position is still
* at the beginning of the segment that has just been switched.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2015-12-21 16:26:25 +0900, Michael Paquier wrote: > On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > Speaking of which, this patch was registered in this CF, I am moving > > it to the next as a bug fix. > > I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it > is possible that the return LSN pointer is on the new segment that has > been forcibly archived if RequestXLogSwitch is called multiple times > and that subsequent calls are not necessary. Patch updated. I find this patch rather unsatisfactory. Yes, it kinda solves the problem of archive timeout, but it leaves the bigger and longer standing problems of unneccessary checkpoints with wal_level=hs in place. It's also somewhat fragile in my opinion. I think we rather want a per backend (or perhaps per wal insertion lock) flag that says 'last relevant record inserted at', and a way to not set that during insertion. Then during a checkpoint or the relevant bgwriter code, we look wether anything relevant happened in any backend since the last time we performed a checkpoint/logged a running xacts snapshot. Greetings, Andres Freund
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Mon, Jan 18, 2016 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-12-21 16:26:25 +0900, Michael Paquier wrote:
> > On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> > > Speaking of which, this patch was registered in this CF, I am moving
> > > it to the next as a bug fix.
> >
> > I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it
> > is possible that the return LSN pointer is on the new segment that has
> > been forcibly archived if RequestXLogSwitch is called multiple times
> > and that subsequent calls are not necessary. Patch updated.
>
> I find this patch rather unsatisfactory. Yes, it kinda solves the
> problem of archive timeout, but it leaves the bigger and longer standing
> problems of unneccessary checkpoints with wal_level=hs in place. It's
> also somewhat fragile in my opinion.
>
> I think we rather want a per backend (or perhaps per wal insertion lock)
> flag that says 'last relevant record inserted at', and a way to not set
> that during insertion. Then during a checkpoint or the relevant bgwriter
> code, we look wether anything relevant happened in any backend since the
> last time we performed a checkpoint/logged a running xacts snapshot.
>
>
> On 2015-12-21 16:26:25 +0900, Michael Paquier wrote:
> > On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> > > Speaking of which, this patch was registered in this CF, I am moving
> > > it to the next as a bug fix.
> >
> > I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it
> > is possible that the return LSN pointer is on the new segment that has
> > been forcibly archived if RequestXLogSwitch is called multiple times
> > and that subsequent calls are not necessary. Patch updated.
>
> I find this patch rather unsatisfactory. Yes, it kinda solves the
> problem of archive timeout, but it leaves the bigger and longer standing
> problems of unneccessary checkpoints with wal_level=hs in place. It's
> also somewhat fragile in my opinion.
>
> I think we rather want a per backend (or perhaps per wal insertion lock)
> flag that says 'last relevant record inserted at', and a way to not set
> that during insertion. Then during a checkpoint or the relevant bgwriter
> code, we look wether anything relevant happened in any backend since the
> last time we performed a checkpoint/logged a running xacts snapshot.
>
Sounds to be a more robust way of dealing with this problem. Michael,
if you don't disagree with above proposal, then we can mark this patch
as Waiting on Author?
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> >> On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier >> > <michael.paquier@gmail.com> >> > wrote: >> >> >> >> >> >> >> > >> > So here if I understand correctly the check related to the last segment >> > forcibly switched is based on the fact that if it is forcibly switched, >> > then >> > we don't need to log this record? What is the reason of such an >> > assumption? >> >> Yes, the thing is that, as mentioned at the beginning of the thread, a >> low value of archive_timeout causes a segment to be forcibly switched >> at the end of the timeout defined by this parameter. In combination >> with the standby snapshot taken in bgwriter since 9.4, this is causing >> segments to be always switched even if a system is completely idle. >> Before that, in 9.3 and older versions, we would have a segment >> forcibly switched on an idle system only once per checkpoint. >> > > Okay, so this will fix the case where the system is idle, but what I > am slightly worried is that it should not miss to log the standby snapshot > due to this check when there is actually some activity in the system. > Why is not possible to have a case such that the segment is forcibly > switched due to reason other than checkpoint (user has requested the > same) and the current insert LSN is at beginning of a new segment > due to write activity? If that is possible, which to me theoretically seems > possible, then I think bgwriter will miss to log the standby snapshot. Yes, with segments forcibly switched by users this check would make the bgwriter not log in a snapshot if the last action done by server was XLOG_SWITCH. Based on the patch I sent, the only alternative would be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine enough for back-branches.. >> The >> documentation is already clear about that actually. The whole point of >> this patch is to fix this regression, aka switch to a new segment only >> once per checkpoint. >> >> > It is not very clear by reading the comments you have >> > added in patch, may be you can expand comments slightly to explain >> > the reason of same. >> >> OK. Here are the comments that this patch is adding: >> - * only log if enough time has passed and some xlog record >> has >> - * been inserted. >> + * Only log if enough time has passed and some xlog record >> has >> + * been inserted on a new segment. On an idle system where >> + * segments can be archived in a fast pace with for example a >> + * low archive_command setting, avoid as well logging a new >> + * standby snapshot if the current insert position is still >> + * at the beginning of the segment that has just been >> switched. >> + * >> + * It is possible that GetXLogLastSwitchPtr points to the >> last >> + * position of previous segment or to the first position of >> the >> + * new segment after the switch, hence take both cases into >> + * account when deciding if a standby snapshot should be >> taken. >> + * (see comments on top of RequestXLogSwitch for more >> details). >> */ >> It makes mention of the problem that it is trying to fix, perhaps you >> mean that this should mention the fact that on an idle system standby >> snapshots should happen at worst once per checkpoint? >> > > I mean to say that in below part of comment, explanation about the > the check related to insert position is quite clear whereas why it is > okay to avoid logging standby snapshot when the segment is not > forcibly switched is not apparent. > > * avoid as well logging a new > * standby snapshot if the current insert position is still > * at the beginning of the segment that has just been switched. Perhaps: "avoid as well logging a new standby snapshot if the current insert position is at the beginning of a segment that has been *forcibly* switched at checkpoint or by archive_timeout"? -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Tue, Jan 19, 2016 at 1:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 18, 2016 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote: >> I find this patch rather unsatisfactory. Yes, it kinda solves the >> problem of archive timeout, but it leaves the bigger and longer standing >> problems of unneccessary checkpoints with wal_level=hs in place. It's >> also somewhat fragile in my opinion. Check. >> I think we rather want a per backend (or perhaps per wal insertion lock) >> flag that says 'last relevant record inserted at', and a way to not set >> that during insertion. Then during a checkpoint or the relevant bgwriter >> code, we look wether anything relevant happened in any backend since the >> last time we performed a checkpoint/logged a running xacts snapshot. And in this case, the last relevant record would be caused by a forced segment switch or a checkpoint record, right? Doing that per WAL insertion lock seems more scalable to me. I haven't looked at the code yet though to see how that would work out. > Sounds to be a more robust way of dealing with this problem. Michael, > if you don't disagree with above proposal, then we can mark this patch > as Waiting on Author? Yeah let's do so. I'll think more about this thing. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Tue, Jan 19, 2016 at 12:41 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> Yes, the thing is that, as mentioned at the beginning of the thread, a
> >> low value of archive_timeout causes a segment to be forcibly switched
> >> at the end of the timeout defined by this parameter. In combination
> >> with the standby snapshot taken in bgwriter since 9.4, this is causing
> >> segments to be always switched even if a system is completely idle.
> >> Before that, in 9.3 and older versions, we would have a segment
> >> forcibly switched on an idle system only once per checkpoint.
> >>
> >
> > Okay, so this will fix the case where the system is idle, but what I
> > am slightly worried is that it should not miss to log the standby snapshot
> > due to this check when there is actually some activity in the system.
> > Why is not possible to have a case such that the segment is forcibly
> > switched due to reason other than checkpoint (user has requested the
> > same) and the current insert LSN is at beginning of a new segment
> > due to write activity? If that is possible, which to me theoretically seems
> > possible, then I think bgwriter will miss to log the standby snapshot.
>
> Yes, with segments forcibly switched by users this check would make
> the bgwriter not log in a snapshot if the last action done by server
> was XLOG_SWITCH. Based on the patch I sent, the only alternative would
> be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine
> enough for back-branches..
>
>
> On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> Yes, the thing is that, as mentioned at the beginning of the thread, a
> >> low value of archive_timeout causes a segment to be forcibly switched
> >> at the end of the timeout defined by this parameter. In combination
> >> with the standby snapshot taken in bgwriter since 9.4, this is causing
> >> segments to be always switched even if a system is completely idle.
> >> Before that, in 9.3 and older versions, we would have a segment
> >> forcibly switched on an idle system only once per checkpoint.
> >>
> >
> > Okay, so this will fix the case where the system is idle, but what I
> > am slightly worried is that it should not miss to log the standby snapshot
> > due to this check when there is actually some activity in the system.
> > Why is not possible to have a case such that the segment is forcibly
> > switched due to reason other than checkpoint (user has requested the
> > same) and the current insert LSN is at beginning of a new segment
> > due to write activity? If that is possible, which to me theoretically seems
> > possible, then I think bgwriter will miss to log the standby snapshot.
>
> Yes, with segments forcibly switched by users this check would make
> the bgwriter not log in a snapshot if the last action done by server
> was XLOG_SWITCH. Based on the patch I sent, the only alternative would
> be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine
> enough for back-branches..
>
Yeah, that can work, but I think the code won't look too clean. I think
lets try out something on lines of what is suggested by Andres and
see how it turns out.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Tue, Jan 19, 2016 at 5:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 19, 2016 at 12:41 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> >> On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier >> > <michael.paquier@gmail.com> wrote: >> >> Yes, the thing is that, as mentioned at the beginning of the thread, a >> >> low value of archive_timeout causes a segment to be forcibly switched >> >> at the end of the timeout defined by this parameter. In combination >> >> with the standby snapshot taken in bgwriter since 9.4, this is causing >> >> segments to be always switched even if a system is completely idle. >> >> Before that, in 9.3 and older versions, we would have a segment >> >> forcibly switched on an idle system only once per checkpoint. >> >> >> > >> > Okay, so this will fix the case where the system is idle, but what I >> > am slightly worried is that it should not miss to log the standby >> > snapshot >> > due to this check when there is actually some activity in the system. >> > Why is not possible to have a case such that the segment is forcibly >> > switched due to reason other than checkpoint (user has requested the >> > same) and the current insert LSN is at beginning of a new segment >> > due to write activity? If that is possible, which to me theoretically >> > seems >> > possible, then I think bgwriter will miss to log the standby snapshot. >> >> Yes, with segments forcibly switched by users this check would make >> the bgwriter not log in a snapshot if the last action done by server >> was XLOG_SWITCH. Based on the patch I sent, the only alternative would >> be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine >> enough for back-branches.. >> > > Yeah, that can work, but I think the code won't look too clean. I think > lets try out something on lines of what is suggested by Andres and > see how it turns out. OK, so as a first step and after thinking about the whole for a while, I have finished with the patch attached. This patch is aimed at avoiding unnecessary checkpoints on idle systems when wal_level >= hot_standby by centralizing the check to look at if there has some WAL activity since the last checkpoint. This way, the system does not generate anymore standby-related records if no activity has occurred. I think that this is a good step forward, still it does not address yet the problem of segments forcibly switched, particularly when archive_timeout has a low value. In order to address the other problem with switched segments, I think that we could extend the routine XLOGHasActivity that the patch attached introduces with some more fine-grained checks. After looking at the code I think as well that we had better save into shared memory the last LSN position that was forcibly switched and make use of that in the bgwriter. So, thoughts about the attached? -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-01-28 16:40:13 +0900, Michael Paquier wrote: > OK, so as a first step and after thinking about the whole for a while, > I have finished with the patch attached. This patch is aimed at > avoiding unnecessary checkpoints on idle systems when wal_level >= > hot_standby by centralizing the check to look at if there has some WAL > activity since the last checkpoint. That's not what I suggested. > /* > + * XLOGHasActivity -- Check if XLOG had some significant activity or > + * if it is idle lately. This is primarily used to check if there has > + * been some WAL activity since the last checkpoint that occurred on > + * system to control the generaton of XLOG record related to standbys. > + */ > +bool > +XLOGHasActivity(void) > +{ > + XLogCtlInsert *Insert = &XLogCtl->Insert; > + XLogRecPtr redo_lsn = ControlFile->checkPointCopy.redo; > + uint64 prev_bytepos; > + > + /* Check if any activity has happened since last checkpoint */ > + SpinLockAcquire(&Insert->insertpos_lck); > + prev_bytepos = Insert->PrevBytePos; > + SpinLockRelease(&Insert->insertpos_lck); > + > + return XLogBytePosToRecPtr(prev_bytepos) == redo_lsn; > +} > How should this actually should work reliably, given we *want* to have included a standby snapshot after the last checkpoint? In CreateCheckPoint() we have/* * Here we update the shared RedoRecPtr for future XLogInsert calls; this * must be done whileholding all the insertion locks. *RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; computing the next redo rec ptr and thenif (!shutdown && XLogStandbyInfoActive()) LogStandbySnapshot(); before the finalXLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));recptr = XLogInsert(RM_XLOG_ID, shutdown ? XLOG_CHECKPOINT_SHUTDOWN : XLOG_CHECKPOINT_ONLINE); so the above condition doesn't really something we want to rely on. Am I missing what you're trying to do? Greetings, Andres Freund
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Fri, Jan 29, 2016 at 5:13 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-01-28 16:40:13 +0900, Michael Paquier wrote: >> OK, so as a first step and after thinking about the whole for a while, >> I have finished with the patch attached. This patch is aimed at >> avoiding unnecessary checkpoints on idle systems when wal_level >= >> hot_standby by centralizing the check to look at if there has some WAL >> activity since the last checkpoint. > > That's not what I suggested. > >> /* >> + * XLOGHasActivity -- Check if XLOG had some significant activity or >> + * if it is idle lately. This is primarily used to check if there has >> + * been some WAL activity since the last checkpoint that occurred on >> + * system to control the generaton of XLOG record related to standbys. >> + */ >> +bool >> +XLOGHasActivity(void) >> +{ >> + XLogCtlInsert *Insert = &XLogCtl->Insert; >> + XLogRecPtr redo_lsn = ControlFile->checkPointCopy.redo; >> + uint64 prev_bytepos; >> + >> + /* Check if any activity has happened since last checkpoint */ >> + SpinLockAcquire(&Insert->insertpos_lck); >> + prev_bytepos = Insert->PrevBytePos; >> + SpinLockRelease(&Insert->insertpos_lck); >> + >> + return XLogBytePosToRecPtr(prev_bytepos) == redo_lsn; >> +} >> > > How should this actually should work reliably, given we *want* to have > included a standby snapshot after the last checkpoint? > > In CreateCheckPoint() we have > /* > * Here we update the shared RedoRecPtr for future XLogInsert calls; this > * must be done while holding all the insertion locks. > * > RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; > > computing the next redo rec ptr and then > if (!shutdown && XLogStandbyInfoActive()) > LogStandbySnapshot(); > before the final > XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint)); > recptr = XLogInsert(RM_XLOG_ID, > shutdown ? XLOG_CHECKPOINT_SHUTDOWN : > XLOG_CHECKPOINT_ONLINE); > > so the above condition doesn't really something we want to rely on. Am I > missing what you're trying to do? Well, to put it short, I am just trying to find a way to make the backend skip unnecessary checkpoints on an idle system, which results in the following WAL pattern if system is completely idle: CHECKPOINT_ONLINE RUNNING_XACTS RUNNING_XACTS [etc..] The thing is that I am lost with the meaning of this condition to decide if a checkpoint should be skipped or not: if (prevPtr == ControlFile->checkPointCopy.redo && prevPtr/ XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) { WALInsertLockRelease(); LWLockRelease(CheckpointLock); return; } As at least one standby snapshot is logged before the checkpoint record, the redo position is never going to match the previous insert LSN, so checkpoints will never be skipped if wal_level >= hot_standby. Skipping such unnecessary checkpoints is what you would like to address, no? Because that's what I would like to do here first. And once we got that right, we could think about addressing the case where WAL segments are forcibly archived for idle systems. Perhaps I simply do not grab the meaning of what you defined as "relevant LSN" upthread. As I understand it, it would be a LSN position marker in shared memory that we could use for some decision-making regarding if a checkpoint or a standby snapshot record should be generated. I have for example played with an array in XLogCtl->Insert made of NUM_XLOGINSERT_LOCKS elements, each one protected by one insert lock that tracked the last insert LSN of a checkpoint record (I did that in XLogInsertRecord because XLogRecData makes that easy), then I used that to do this decision making in CreateCheckpoint() and in the bgwriter to decide if a standby snapshot should be logged or not. But then I noticed that it would be actually easier and cleaner to do this decision making using directly the checkpoint redo LSN and compare that with the previous insert LSN, then use that for the bgwriter code path, resulting in the WIP I just sent previously. Is what I am trying to do here biased with what you have in mind? Do we have a different meaning of the problem in mind? What are your ideas regarding this relevant LSN? -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote: > Well, to put it short, I am just trying to find a way to make the > backend skip unnecessary checkpoints on an idle system, which results > in the following WAL pattern if system is completely idle: > CHECKPOINT_ONLINE > RUNNING_XACTS > RUNNING_XACTS > [etc..] > > The thing is that I am lost with the meaning of this condition to > decide if a checkpoint should be skipped or not: > if (prevPtr == ControlFile->checkPointCopy.redo && > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) > { > WALInsertLockRelease(); > LWLockRelease(CheckpointLock); > return; > } > As at least one standby snapshot is logged before the checkpoint > record, the redo position is never going to match the previous insert > LSN, so checkpoints will never be skipped if wal_level >= hot_standby. > Skipping such unnecessary checkpoints is what you would like to > address, no? Because that's what I would like to do here first. And > once we got that right, we could think about addressing the case where > WAL segments are forcibly archived for idle systems. I have put a bit more of brain power into that, and finished with the patch attached. A new field called chkpProgressLSN is attached to XLogCtl, which is to the current insert position of the checkpoint when wal_level <= archive, or the LSN position of the standby snapshot taken after a checkpoint. The bgwriter code is modified as well so as it uses this progress LSN and compares it with the current insert LSN to determine if a standby snapshot should be logged or not. On an instance of Postgres completely idle, no useless checkpoints, and no useless standby snapshots are generated anymore. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Jan 30, 2016 at 11:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote: >> Well, to put it short, I am just trying to find a way to make the >> backend skip unnecessary checkpoints on an idle system, which results >> in the following WAL pattern if system is completely idle: >> CHECKPOINT_ONLINE >> RUNNING_XACTS >> RUNNING_XACTS >> [etc..] >> >> The thing is that I am lost with the meaning of this condition to >> decide if a checkpoint should be skipped or not: >> if (prevPtr == ControlFile->checkPointCopy.redo && >> prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) >> { >> WALInsertLockRelease(); >> LWLockRelease(CheckpointLock); >> return; >> } >> As at least one standby snapshot is logged before the checkpoint >> record, the redo position is never going to match the previous insert >> LSN, so checkpoints will never be skipped if wal_level >= hot_standby. >> Skipping such unnecessary checkpoints is what you would like to >> address, no? Because that's what I would like to do here first. And >> once we got that right, we could think about addressing the case where >> WAL segments are forcibly archived for idle systems. > > I have put a bit more of brain power into that, and finished with the > patch attached. A new field called chkpProgressLSN is attached to > XLogCtl, which is to the current insert position of the checkpoint > when wal_level <= archive, or the LSN position of the standby snapshot > taken after a checkpoint. The bgwriter code is modified as well so as > it uses this progress LSN and compares it with the current insert LSN > to determine if a standby snapshot should be logged or not. On an > instance of Postgres completely idle, no useless checkpoints, and no > useless standby snapshots are generated anymore. Attached is an additional patch that I have used for my tests (should have sent that yesterday). This just shows up a couple of logs in the bgwriter and around checkpoint to see in more details their activity with not that much noise. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Sat, Jan 30, 2016 at 7:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
> > Well, to put it short, I am just trying to find a way to make the
> > backend skip unnecessary checkpoints on an idle system, which results
> > in the following WAL pattern if system is completely idle:
> > CHECKPOINT_ONLINE
> > RUNNING_XACTS
> > RUNNING_XACTS
> > [etc..]
> >
> > The thing is that I am lost with the meaning of this condition to
> > decide if a checkpoint should be skipped or not:
> > if (prevPtr == ControlFile->checkPointCopy.redo &&
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> > {
> > WALInsertLockRelease();
> > LWLockRelease(CheckpointLock);
> > return;
> > }
> > As at least one standby snapshot is logged before the checkpoint
> > record, the redo position is never going to match the previous insert
> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
> > Skipping such unnecessary checkpoints is what you would like to
> > address, no? Because that's what I would like to do here first. And
> > once we got that right, we could think about addressing the case where
> > WAL segments are forcibly archived for idle systems.
>
> I have put a bit more of brain power into that, and finished with the
> patch attached. A new field called chkpProgressLSN is attached to
> XLogCtl, which is to the current insert position of the checkpoint
> when wal_level <= archive, or the LSN position of the standby snapshot
> taken after a checkpoint. The bgwriter code is modified as well so as
> it uses this progress LSN and compares it with the current insert LSN
> to determine if a standby snapshot should be logged or not. On an
> instance of Postgres completely idle, no useless checkpoints, and no
> useless standby snapshots are generated anymore.
>
>
> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
> > Well, to put it short, I am just trying to find a way to make the
> > backend skip unnecessary checkpoints on an idle system, which results
> > in the following WAL pattern if system is completely idle:
> > CHECKPOINT_ONLINE
> > RUNNING_XACTS
> > RUNNING_XACTS
> > [etc..]
> >
> > The thing is that I am lost with the meaning of this condition to
> > decide if a checkpoint should be skipped or not:
> > if (prevPtr == ControlFile->checkPointCopy.redo &&
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> > {
> > WALInsertLockRelease();
> > LWLockRelease(CheckpointLock);
> > return;
> > }
> > As at least one standby snapshot is logged before the checkpoint
> > record, the redo position is never going to match the previous insert
> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
> > Skipping such unnecessary checkpoints is what you would like to
> > address, no? Because that's what I would like to do here first. And
> > once we got that right, we could think about addressing the case where
> > WAL segments are forcibly archived for idle systems.
>
> I have put a bit more of brain power into that, and finished with the
> patch attached. A new field called chkpProgressLSN is attached to
> XLogCtl, which is to the current insert position of the checkpoint
> when wal_level <= archive, or the LSN position of the standby snapshot
> taken after a checkpoint. The bgwriter code is modified as well so as
> it uses this progress LSN and compares it with the current insert LSN
> to determine if a standby snapshot should be logged or not. On an
> instance of Postgres completely idle, no useless checkpoints, and no
> useless standby snapshots are generated anymore.
>
@@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FORCE)) == 0)
{
- if
(prevPtr == ControlFile->checkPointCopy.redo &&
+ if (GetProgressRecPtr() == prevPtr &&
prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
{
I think such a check won't consider all the WAL-activity happened
during checkpoint (after checkpoint start, till it finishes) which was
not the intention of this code without your patch.
I think both this and previous patch (hs-checkpoints-v1 ) approach
won't fix the issue in all kind of scenario's.
Let me try to explain what I think should fix this issue based on
discussion above, suggestions by Andres and some of my own
thoughts:
Have a new variable in WALInsertLock that would indicate the last
insertion position (last_insert_pos) we write to after holding that lock.
Ensure that we don't update last_insert_pos while logging standby
snapshot (simple way is to pass a flag in XLogInsert or distinguish
it based on type of record (XLOG_RUNNING_XACTS) or if you can
think of a more smarter way). Now both during checkpoint and
in bgwriter, to find out whether there is any activity since last
time, we need to check all the WALInsertLocks for latest insert position
(by referring last_insert_pos) and compare it with the latest position
we have noted during last such check and decide whether to proceed
or not based on comparison result. If you think it is not adequate to
store last_insert_pos in WALInsertLock, then we can think of having
it in PGPROC.
Yet another idea that occurs to me this morning is that why not
have a variable in shared memory in XLogCtlInsert or XLogCtl
similar to CurrBytePos/PrevBytePos which will be updated on
each XLOGInsert apart from the XLOGInsert for standby snapshots
and use that in a patch somewhat close to what you have in
hs-checkpoints-v1.
One related point is don't we need to bother about activity on
unlogged relations for checkpoints to occur, considering the
case when the only activity happened after last checkpoint
is on unlogged relations?
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-02-02 10:12:25 +0530, Amit Kapila wrote: > @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags) > if ((flags & (CHECKPOINT_IS_SHUTDOWN | > CHECKPOINT_END_OF_RECOVERY | > CHECKPOINT_FORCE)) == 0) > { > - if > (prevPtr == ControlFile->checkPointCopy.redo && > + if (GetProgressRecPtr() == prevPtr && > > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) > { > > I think such a check won't consider all the WAL-activity happened > during checkpoint (after checkpoint start, till it finishes) which was > not the intention of this code without your patch. Precisely. > I think both this and previous patch (hs-checkpoints-v1 ) approach > won't fix the issue in all kind of scenario's. Agreed. > Let me try to explain what I think should fix this issue based on > discussion above, suggestions by Andres and some of my own > thoughts: > Have a new variable in WALInsertLock that would indicate the last > insertion position (last_insert_pos) we write to after holding that lock. > Ensure that we don't update last_insert_pos while logging standby > snapshot (simple way is to pass a flag in XLogInsert or distinguish > it based on type of record (XLOG_RUNNING_XACTS) or if you can > think of a more smarter way). Now both during checkpoint and > in bgwriter, to find out whether there is any activity since last > time, we need to check all the WALInsertLocks for latest insert position > (by referring last_insert_pos) and compare it with the latest position > we have noted during last such check and decide whether to proceed > or not based on comparison result. If you think it is not adequate to > store last_insert_pos in WALInsertLock, then we can think of having > it in PGPROC. Yes, that's pretty much what I was thinking of. > Yet another idea that occurs to me this morning is that why not > have a variable in shared memory in XLogCtlInsert or XLogCtl > similar to CurrBytePos/PrevBytePos which will be updated on > each XLOGInsert apart from the XLOGInsert for standby snapshots > and use that in a patch somewhat close to what you have in > hs-checkpoints-v1. That'll require holding locks longer... Greetings, Andres Freund
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Tue, Feb 2, 2016 at 1:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Jan 30, 2016 at 7:38 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote: >> > Well, to put it short, I am just trying to find a way to make the >> > backend skip unnecessary checkpoints on an idle system, which results >> > in the following WAL pattern if system is completely idle: >> > CHECKPOINT_ONLINE >> > RUNNING_XACTS >> > RUNNING_XACTS >> > [etc..] >> > >> > The thing is that I am lost with the meaning of this condition to >> > decide if a checkpoint should be skipped or not: >> > if (prevPtr == ControlFile->checkPointCopy.redo && >> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) >> > { >> > WALInsertLockRelease(); >> > LWLockRelease(CheckpointLock); >> > return; >> > } >> > As at least one standby snapshot is logged before the checkpoint >> > record, the redo position is never going to match the previous insert >> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby. >> > Skipping such unnecessary checkpoints is what you would like to >> > address, no? Because that's what I would like to do here first. And >> > once we got that right, we could think about addressing the case where >> > WAL segments are forcibly archived for idle systems. >> >> I have put a bit more of brain power into that, and finished with the >> patch attached. A new field called chkpProgressLSN is attached to >> XLogCtl, which is to the current insert position of the checkpoint >> when wal_level <= archive, or the LSN position of the standby snapshot >> taken after a checkpoint. The bgwriter code is modified as well so as >> it uses this progress LSN and compares it with the current insert LSN >> to determine if a standby snapshot should be logged or not. On an >> instance of Postgres completely idle, no useless checkpoints, and no >> useless standby snapshots are generated anymore. >> > > > @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags) > if ((flags & (CHECKPOINT_IS_SHUTDOWN | > CHECKPOINT_END_OF_RECOVERY | > CHECKPOINT_FORCE)) == 0) > { > - if > (prevPtr == ControlFile->checkPointCopy.redo && > + if (GetProgressRecPtr() == prevPtr && > > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) > { > > I think such a check won't consider all the WAL-activity happened > during checkpoint (after checkpoint start, till it finishes) which was > not the intention of this code without your patch. Yes, that's true, in v2 this progress LSN can be updated in two ways: - At the position where checkpoint has begun - At the position where standby snapshot was taken after a checkpoint has run. Honestly, even if we log the snapshot for example before CheckpointGuts(), that's not going to make it... The main issue here is that there will be a snapshot after this checkpoint, so the existing condition is a rather broken concept already when wal_level >= hot_standby because the redo pointer is never going to match the previous LSN pointer. Another idea would be to ensure that the snapshot is logged just after the redo pointer, this would be reliable. > I think both this and previous patch (hs-checkpoints-v1) approach > won't fix the issue in all kind of scenario's. > > Let me try to explain what I think should fix this issue based on > discussion above, suggestions by Andres and some of my own > thoughts: > > Have a new variable in WALInsertLock that would indicate the last > insertion position (last_insert_pos) we write to after holding that lock. > Ensure that we don't update last_insert_pos while logging standby > snapshot (simple way is to pass a flag in XLogInsert or distinguish > it based on type of record (XLOG_RUNNING_XACTS) or if you can > think of a more smarter way). Simplest way is in XLogInsertRecord, the record data is available there, there is for example some logic related to segment switches. > Now both during checkpoint and > in bgwriter, to find out whether there is any activity since last > time, we need to check all the WALInsertLocks for latest insert position > (by referring last_insert_pos) and compare it with the latest position > we have noted during last such check and decide whether to proceed > or not based on comparison result. If you think it is not adequate to > store last_insert_pos in WALInsertLock, then we can think of having > it in PGPROC. So the progress check is used when deciding if a checkpoint should be skipped or not, and when deciding if a standby snapshot should be taken by the bgwriter? When bgwriter logs a snapshot, it will update as well the last LSN found (which would be a single variable in for example XLogCtlData, updated from the data taken from the WAL insert slots). But there is a problem here: if there is no activity since the last bgwriter snapshot, the next checkpoint would be skipped as well. It seems to me that this is not acceptable, a checkpoint generation would be decided based on if there has been some data activity since the last snapshot taken, or to put it in other words, no checkpoints would happen if there has been no activity within 15s after the last standby snapshot has been logged by the bgwriter. I have hacked something according to those lines, please see attached (logs are just here for testing purposes). I may be missing something obvious regarding the tracking of the last progress position. Code speaks better than words here I think, so feel free to look at it. > Yet another idea that occurs to me this morning is that why not > have a variable in shared memory in XLogCtlInsert or XLogCtl > similar to CurrBytePos/PrevBytePos which will be updated on > each XLOGInsert apart from the XLOGInsert for standby snapshots > and use that in a patch somewhat close to what you have in > hs-checkpoints-v1. I have considered that as well, but I do not think it is a good idea to add more contention in ReserveXLogInsertLocation() while taking insertpos_lck lock. That's already really hot, and we surely don't want to make it hotter just to do checks on idle systems. If that's what you had in mind. Btw, this is basically what I did in the attached, no? Except that the current insert position is tracked differently. >From v2 I sent upthread: - if (prevPtr == ControlFile->checkPointCopy.redo && + if (GetProgressRecPtr() == prevPtr && prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) This is wrong actually. prevPtr should be replaced by the redo LSN. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Wed, Feb 3, 2016 at 3:08 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-02-02 10:12:25 +0530, Amit Kapila wrote:
> > @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
> > if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> > CHECKPOINT_END_OF_RECOVERY |
> > CHECKPOINT_FORCE)) == 0)
> > {
> > - if
> > (prevPtr == ControlFile->checkPointCopy.redo &&
> > + if (GetProgressRecPtr() == prevPtr &&
> >
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> > {
> >
> > I think such a check won't consider all the WAL-activity happened
> > during checkpoint (after checkpoint start, till it finishes) which was
> > not the intention of this code without your patch.
>
> Precisely.
>
> > I think both this and previous patch (hs-checkpoints-v1 ) approach
> > won't fix the issue in all kind of scenario's.
>
> Agreed.
>
> > Let me try to explain what I think should fix this issue based on
> > discussion above, suggestions by Andres and some of my own
> > thoughts:
>
> > Have a new variable in WALInsertLock that would indicate the last
> > insertion position (last_insert_pos) we write to after holding that lock.
> > Ensure that we don't update last_insert_pos while logging standby
> > snapshot (simple way is to pass a flag in XLogInsert or distinguish
> > it based on type of record (XLOG_RUNNING_XACTS) or if you can
> > think of a more smarter way). Now both during checkpoint and
> > in bgwriter, to find out whether there is any activity since last
> > time, we need to check all the WALInsertLocks for latest insert position
> > (by referring last_insert_pos) and compare it with the latest position
> > we have noted during last such check and decide whether to proceed
> > or not based on comparison result. If you think it is not adequate to
> > store last_insert_pos in WALInsertLock, then we can think of having
> > it in PGPROC.
>
> Yes, that's pretty much what I was thinking of.
>
>
> On 2016-02-02 10:12:25 +0530, Amit Kapila wrote:
> > @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
> > if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> > CHECKPOINT_END_OF_RECOVERY |
> > CHECKPOINT_FORCE)) == 0)
> > {
> > - if
> > (prevPtr == ControlFile->checkPointCopy.redo &&
> > + if (GetProgressRecPtr() == prevPtr &&
> >
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> > {
> >
> > I think such a check won't consider all the WAL-activity happened
> > during checkpoint (after checkpoint start, till it finishes) which was
> > not the intention of this code without your patch.
>
> Precisely.
>
> > I think both this and previous patch (hs-checkpoints-v1 ) approach
> > won't fix the issue in all kind of scenario's.
>
> Agreed.
>
> > Let me try to explain what I think should fix this issue based on
> > discussion above, suggestions by Andres and some of my own
> > thoughts:
>
> > Have a new variable in WALInsertLock that would indicate the last
> > insertion position (last_insert_pos) we write to after holding that lock.
> > Ensure that we don't update last_insert_pos while logging standby
> > snapshot (simple way is to pass a flag in XLogInsert or distinguish
> > it based on type of record (XLOG_RUNNING_XACTS) or if you can
> > think of a more smarter way). Now both during checkpoint and
> > in bgwriter, to find out whether there is any activity since last
> > time, we need to check all the WALInsertLocks for latest insert position
> > (by referring last_insert_pos) and compare it with the latest position
> > we have noted during last such check and decide whether to proceed
> > or not based on comparison result. If you think it is not adequate to
> > store last_insert_pos in WALInsertLock, then we can think of having
> > it in PGPROC.
>
> Yes, that's pretty much what I was thinking of.
>
I think generally it is good idea, but one thing worth a thought is that
by doing so, we need to acquire all WAL Insertion locks every
LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
every slot, do you think it is matter of concern in any way for write
workloads or it won't effect as we need to do this periodically?
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-02-04 18:21:41 +0530, Amit Kapila wrote: > I think generally it is good idea, but one thing worth a thought is that > by doing so, we need to acquire all WAL Insertion locks every > LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for > every slot, do you think it is matter of concern in any way for write > workloads or it won't effect as we need to do this periodically? Michael and I just had an in-person discussion, and one of the topics was that. The plan was basically to adapt the patch to: 1) Store the progress lsn inside the wal insert lock 2) Change the HasActivity API to return an the last LSN at which there was activity, instead of a boolean. 2) Individually acquire each insert locks's lwlock to get it's progress LSN, but not the exclusive insert lock. We needthe lwllock to avoid a torn 8byte read on some platforms. I think that mostly should address your concerns? Andres
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Thu, Feb 4, 2016 at 6:40 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
> > I think generally it is good idea, but one thing worth a thought is that
> > by doing so, we need to acquire all WAL Insertion locks every
> > LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
> > every slot, do you think it is matter of concern in any way for write
> > workloads or it won't effect as we need to do this periodically?
>
> Michael and I just had an in-person discussion, and one of the topics
> was that. The plan was basically to adapt the patch to:
> 1) Store the progress lsn inside the wal insert lock
> 2) Change the HasActivity API to return an the last LSN at which there
> was activity, instead of a boolean.
> 2) Individually acquire each insert locks's lwlock to get it's progress
> LSN, but not the exclusive insert lock. We need the lwllock to avoid
> a torn 8byte read on some platforms.
>
> I think that mostly should address your concerns?
>
>
> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
> > I think generally it is good idea, but one thing worth a thought is that
> > by doing so, we need to acquire all WAL Insertion locks every
> > LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
> > every slot, do you think it is matter of concern in any way for write
> > workloads or it won't effect as we need to do this periodically?
>
> Michael and I just had an in-person discussion, and one of the topics
> was that. The plan was basically to adapt the patch to:
> 1) Store the progress lsn inside the wal insert lock
> 2) Change the HasActivity API to return an the last LSN at which there
> was activity, instead of a boolean.
> 2) Individually acquire each insert locks's lwlock to get it's progress
> LSN, but not the exclusive insert lock. We need the lwllock to avoid
> a torn 8byte read on some platforms.
>
> I think that mostly should address your concerns?
>
Yes, this sounds better and in-anycase we can do some benchmarks
to verify the same once patch is in shape.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Thu, Feb 4, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-04 18:21:41 +0530, Amit Kapila wrote: >> I think generally it is good idea, but one thing worth a thought is that >> by doing so, we need to acquire all WAL Insertion locks every >> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for >> every slot, do you think it is matter of concern in any way for write >> workloads or it won't effect as we need to do this periodically? > > Michael and I just had an in-person discussion, and one of the topics > was that. The plan was basically to adapt the patch to: > 1) Store the progress lsn inside the wal insert lock > 2) Change the HasActivity API to return an the last LSN at which there > was activity, instead of a boolean. > 3) Individually acquire each insert locks's lwlock to get it's progress > LSN, but not the exclusive insert lock. We need the lwllock to avoid > a torn 8byte read on some platforms. 4) Switch the condition to decide if a checkpoint should be skipped using the last activity position compared with ProcLastRecPtr in CreateCheckpoint to see if any activity has occurred since the checkpoint record was inserted, and do not care anymore if the previous record and current record are on different segments. This would basically work. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Thu, Feb 4, 2016 at 6:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote: >>> I think generally it is good idea, but one thing worth a thought is that >>> by doing so, we need to acquire all WAL Insertion locks every >>> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for >>> every slot, do you think it is matter of concern in any way for write >>> workloads or it won't effect as we need to do this periodically? >> >> Michael and I just had an in-person discussion, and one of the topics >> was that. The plan was basically to adapt the patch to: >> 1) Store the progress lsn inside the wal insert lock >> 2) Change the HasActivity API to return an the last LSN at which there >> was activity, instead of a boolean. >> 3) Individually acquire each insert locks's lwlock to get it's progress >> LSN, but not the exclusive insert lock. We need the lwllock to avoid >> a torn 8byte read on some platforms. > > 4) Switch the condition to decide if a checkpoint should be skipped > using the last activity position compared with ProcLastRecPtr in > CreateCheckpoint to see if any activity has occurred since the > checkpoint record was inserted, and do not care anymore if the > previous record and current record are on different segments. This > would basically work. OK, attached is a patch that I believe addresses those issues. The patch still has a couple of LOG entries that we had better remove in the version that gets pushed, but they have proved to be useful for me when testing the patch with a low checkpoint_timeout value to see if checkpoints are properly skipped on an idle system. I found myself adding another routine called GetLastCheckpointRecPtr() for the bgwriter because ControlFile is not declared externally even if it is in shared memory. I think that's better this way. The original bug report referred to a low archive_timeout causing standby snapshots to be logged when segments are switched. So we may want at the end to not update the progress LSN for segment switch records as well, but I have let that out of the patch for the time being to address the primary concern of unnecessary checkpoints for wal_level >= hs. We could address it with a later patch (planning to do so), let's keep a step-by-step approach. The patch attached will apply on master, on 9.5 there is one minor conflict. For older versions we will need another reworked patch. I am fine to produce those once we are fine with the shape of what gets into master and 9.5. 9.4 uses WAL insert locks so things are rather similar. For ~9.3, I think that we are going to need a single variable in XLogCtl or similar to track the progress, but I have not looked into that in details yet. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-02-06 22:03:15 +0900, Michael Paquier wrote: > The patch attached will apply on master, on 9.5 there is one minor > conflict. For older versions we will need another reworked patch. FWIW, I don't think we should backpatch this. It'd look noticeably different in the back branches, and this isn't really a critical issue. I think it makes sense to see this as an optimization. > + /* > + * Update the progress LSN positions. At least one WAL insertion lock > + * is already taken appropriately before doing that, and it is just more > + * simple to do that here where WAL record data and type is at hand. > + * The progress is set at the start position of the record tracked that > + * is being added, making easier checkpoint progress tracking as the > + * control file already saves the start LSN position of the last > + * checkpoint run. > + */ > + if (!isStandbySnapshot) > + { I don't like isStandbySnapshot much, it seems better to do this more generally, via a flag passed down by the inserter. > + if (holdingAllLocks) > + { > + int i; > + > + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++) > + WALInsertLocks[i].l.progressAt = StartPos; Why update all? > /* > + * GetProgressRecPtr -- Returns the newest WAL activity position, aimed > + * at the last significant WAL activity, or in other words any activity > + * not referring to standby logging as of now. Finding the last activity > + * position is done by scanning each WAL insertion lock by taking directly > + * the light-weight lock associated to it. > + */ > +XLogRecPtr > +GetProgressRecPtr(void) > +{ > + XLogRecPtr res = InvalidXLogRecPtr; > + int i; > + > + /* > + * Look at the latest LSN position referring to the activity done by > + * WAL insertion. > + */ > + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++) > + { > + XLogRecPtr progress_lsn; > + > + LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE); > + progress_lsn = WALInsertLocks[i].l.progressAt; > + LWLockRelease(&WALInsertLocks[i].l.lock); Add a comment explaining that we a) need a lock because of the potential for "torn reads" on some platforms. b) need an exclusive one, because the insert lock logic currently only expects exclusive locks. > /* > + * Fetch the progress position before taking any WAL insert lock. This > + * is normally an operation that does not take long, but leaving this > + * lookup out of the section taken an exclusive lock saves a couple > + * of instructions. > + */ > + progress_lsn = GetProgressRecPtr(); too long for my taste. How about: /* get progress, before acuiring insert locks to shorten locked section */ Looks much better now. Greetings, Andres Freund
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-06 22:03:15 +0900, Michael Paquier wrote: >> + /* >> + * Update the progress LSN positions. At least one WAL insertion lock >> + * is already taken appropriately before doing that, and it is just more >> + * simple to do that here where WAL record data and type is at hand. >> + * The progress is set at the start position of the record tracked that >> + * is being added, making easier checkpoint progress tracking as the >> + * control file already saves the start LSN position of the last >> + * checkpoint run. >> + */ >> + if (!isStandbySnapshot) >> + { > > I don't like isStandbySnapshot much, it seems better to do this more > generally, via a flag passed down by the inserter. Instead of updating every single call of XLogInsert() in the system, what do you think about introducing a new routine XLogInsertExtended() that would have this optional flag? This would wrap the existing XLogInsert() and pass the flag to XLogInsertRecord(). -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-06 22:03:15 +0900, Michael Paquier wrote: >> The patch attached will apply on master, on 9.5 there is one minor >> conflict. For older versions we will need another reworked patch. > > FWIW, I don't think we should backpatch this. It'd look noticeably > different in the back branches, and this isn't really a critical > issue. I think it makes sense to see this as an optimization. The original bug report of this thread is based on 9.4, which is the first version where the standby snapshot in the bgwriter has been introduced. Knowing that most of the systems in the world are actually let idle. If those are running Postgres, I'd like to think that it is a waste. Now it is true that this is not a critical issue. >> + /* >> + * Update the progress LSN positions. At least one WAL insertion lock >> + * is already taken appropriately before doing that, and it is just more >> + * simple to do that here where WAL record data and type is at hand. >> + * The progress is set at the start position of the record tracked that >> + * is being added, making easier checkpoint progress tracking as the >> + * control file already saves the start LSN position of the last >> + * checkpoint run. >> + */ >> + if (!isStandbySnapshot) >> + { > > I don't like isStandbySnapshot much, it seems better to do this more > generally, via a flag passed down by the inserter. Hm, OK. I think that the best way to deal with that is to introduce XLogInsertExtended which wraps the existing XLogInsert(). By default the flag is true. This will reduce the footprint of this patch on the code base and will ease future backpatches on those code paths at least down to 9.5 where WAL refactoring has been introduced. >> + if (holdingAllLocks) >> + { >> + int i; >> + >> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++) >> + WALInsertLocks[i].l.progressAt = StartPos; > > Why update all? For consistency. I agree that updating one, say the first one is enough. I have added a comment explaining that as well. >> /* >> + * GetProgressRecPtr -- Returns the newest WAL activity position, aimed >> + * at the last significant WAL activity, or in other words any activity >> + * not referring to standby logging as of now. Finding the last activity >> + * position is done by scanning each WAL insertion lock by taking directly >> + * the light-weight lock associated to it. >> + */ >> +XLogRecPtr >> +GetProgressRecPtr(void) >> +{ >> + XLogRecPtr res = InvalidXLogRecPtr; >> + int i; >> + >> + /* >> + * Look at the latest LSN position referring to the activity done by >> + * WAL insertion. >> + */ >> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++) >> + { >> + XLogRecPtr progress_lsn; >> + >> + LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE); >> + progress_lsn = WALInsertLocks[i].l.progressAt; >> + LWLockRelease(&WALInsertLocks[i].l.lock); > > Add a comment explaining that we a) need a lock because of the potential > for "torn reads" on some platforms. b) need an exclusive one, because > the insert lock logic currently only expects exclusive locks. Check. >> /* >> + * Fetch the progress position before taking any WAL insert lock. This >> + * is normally an operation that does not take long, but leaving this >> + * lookup out of the section taken an exclusive lock saves a couple >> + * of instructions. >> + */ >> + progress_lsn = GetProgressRecPtr(); > > too long for my taste. How about: > /* get progress, before acuiring insert locks to shorten locked section */ Check. > Looks much better now. Thanks. Let's wrap that! An updated patch is attached. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-02-08 15:58:49 +0900, Michael Paquier wrote: > On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-02-06 22:03:15 +0900, Michael Paquier wrote: > >> The patch attached will apply on master, on 9.5 there is one minor > >> conflict. For older versions we will need another reworked patch. > > > > FWIW, I don't think we should backpatch this. It'd look noticeably > > different in the back branches, and this isn't really a critical > > issue. I think it makes sense to see this as an optimization. > > The original bug report of this thread is based on 9.4, which is the > first version where the standby snapshot in the bgwriter has been > introduced. Knowing that most of the systems in the world are actually > let idle. If those are running Postgres, I'd like to think that it is > a waste. Now it is true that this is not a critical issue. Unconvinced. For one, the equivalent behaviour for checkpoints has existed since at least 9.0. Secondly, the problem only really appears if you use archiving, configure a nondefault archive timeout, and that timeout is significantly smaller than checkpoint timeout. And then the cost is partially filled segment every now and then. That just doesn't stand into relation into adding some nontrivial code into backbranches. > >> + if (holdingAllLocks) > >> + { > >> + int i; > >> + > >> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++) > >> + WALInsertLocks[i].l.progressAt = StartPos; > > > > Why update all? > > For consistency. I agree that updating one, say the first one is > enough. I have added a comment explaining that as well. We don't set valid values in WALInsertLockAcquireExclusive for all locks either. I don't see consistency to what this is going to buy us. > /* > + * XLogInsert > + * > + * A shorthand for XLogInsertExtended, to update the progress of WAL > + * activity by default. > + */ > +XLogRecPtr > +XLogInsert(RmgrId rmid, uint8 info) > +{ > + return XLogInsertExtended(rmid, info, true); > +} > + > +/* > + * XLogInsertExtended I'm not really a fan. I'd rather change existing callers to add a 'flags' bitmask argument. Right now there can't really be XLogInserts() in extension code, so that's pretty ok to change. Greetings, Andres Freund
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Feb 8, 2016 at 6:18 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-08 15:58:49 +0900, Michael Paquier wrote: >> On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote: >> /* >> + * XLogInsert >> + * >> + * A shorthand for XLogInsertExtended, to update the progress of WAL >> + * activity by default. >> + */ >> +XLogRecPtr >> +XLogInsert(RmgrId rmid, uint8 info) >> +{ >> + return XLogInsertExtended(rmid, info, true); >> +} >> + >> +/* >> + * XLogInsertExtended > > I'm not really a fan. I'd rather change existing callers to add a > 'flags' bitmask argument. Right now there can't really be XLogInserts() > in extension code, so that's pretty ok to change. So, you mean to extend directly XLogInsert() with an int flag that has only one possible value for now, and update *all* the calls of XLogInsert(). Well, I am detecting 218 calls of XLogInsert() in the code, so that's not a small change. Honestly, I'd really rather have the Extended() routine with default taking 0 for the integer flag, 0 meaning in the case of the progress that it gets updated. We have similar routines for RangeVar, some in bufmgr.c and some other places. So, which way do you want to take? I don't expect to win this argument if that's a 1vs1 against a committer, just I would like to be sure that I am not doing useless things. Time matters. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
>
> >> /*
> >> + * Fetch the progress position before taking any WAL insert lock. This
> >> + * is normally an operation that does not take long, but leaving this
> >> + * lookup out of the section taken an exclusive lock saves a couple
> >> + * of instructions.
> >> + */
> >> + progress_lsn = GetProgressRecPtr();
> >
> > too long for my taste. How about:
> > /* get progress, before acuiring insert locks to shorten locked section */
>
> Check.
>
>
>
> >> /*
> >> + * Fetch the progress position before taking any WAL insert lock. This
> >> + * is normally an operation that does not take long, but leaving this
> >> + * lookup out of the section taken an exclusive lock saves a couple
> >> + * of instructions.
> >> + */
> >> + progress_lsn = GetProgressRecPtr();
> >
> > too long for my taste. How about:
> > /* get progress, before acuiring insert locks to shorten locked section */
>
> Check.
>
What is the need of holding locks one-by-one during checkpoint when
we anyway are going to take lock on all the insertion slots.
+ * to not rely on taking an exclusive lock an all the WAL insertion locks,
/an all/on all
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Feb 8, 2016 at 11:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> >> >> /* >> >> + * Fetch the progress position before taking any WAL insert lock. >> >> This >> >> + * is normally an operation that does not take long, but leaving >> >> this >> >> + * lookup out of the section taken an exclusive lock saves a >> >> couple >> >> + * of instructions. >> >> + */ >> >> + progress_lsn = GetProgressRecPtr(); >> > >> > too long for my taste. How about: >> > /* get progress, before acuiring insert locks to shorten locked section >> > */ >> >> Check. >> > > What is the need of holding locks one-by-one during checkpoint when > we anyway are going to take lock on all the insertion slots. A couple of records can slip in while scanning the progress LSN through all the locks. > + * to not rely on taking an exclusive lock an all the WAL insertion locks, > > /an all/on all Nice catch. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Tue, Feb 9, 2016 at 5:37 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Mon, Feb 8, 2016 at 11:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >> >> /*
> >> >> + * Fetch the progress position before taking any WAL insert lock.
> >> >> This
> >> >> + * is normally an operation that does not take long, but leaving
> >> >> this
> >> >> + * lookup out of the section taken an exclusive lock saves a
> >> >> couple
> >> >> + * of instructions.
> >> >> + */
> >> >> + progress_lsn = GetProgressRecPtr();
> >> >
> >> > too long for my taste. How about:
> >> > /* get progress, before acuiring insert locks to shorten locked section
> >> > */
> >>
> >> Check.
> >>
> >
> > What is the need of holding locks one-by-one during checkpoint when
> > we anyway are going to take lock on all the insertion slots.
>
> A couple of records can slip in while scanning the progress LSN
> through all the locks.
>
>
> On Mon, Feb 8, 2016 at 11:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >> >> /*
> >> >> + * Fetch the progress position before taking any WAL insert lock.
> >> >> This
> >> >> + * is normally an operation that does not take long, but leaving
> >> >> this
> >> >> + * lookup out of the section taken an exclusive lock saves a
> >> >> couple
> >> >> + * of instructions.
> >> >> + */
> >> >> + progress_lsn = GetProgressRecPtr();
> >> >
> >> > too long for my taste. How about:
> >> > /* get progress, before acuiring insert locks to shorten locked section
> >> > */
> >>
> >> Check.
> >>
> >
> > What is the need of holding locks one-by-one during checkpoint when
> > we anyway are going to take lock on all the insertion slots.
>
> A couple of records can slip in while scanning the progress LSN
> through all the locks.
>
Do you see any benefit in allowing checkpoints for such cases considering
bgwriter will anyway take care of logging standby snapshot for such
cases?
I don't think there is any reasonable benefit by improving the situation of
idle-system check for checkpoint other than just including
standbysnapshot WAL record. OTOH as checkpoint is not so seldom
operation, so allowing such a change should be okay, but I don't see
the need for same.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Tue, Feb 9, 2016 at 2:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Do you see any benefit in allowing checkpoints for such cases considering > bgwriter will anyway take care of logging standby snapshot for such > cases? Well, the idea is to improve the system responsiveness. Imagine that the call to GetProgressRecPtr() is done within the exclusive lock portion, but that while scanning the WAL insert lock entries another backend is waiting to take a lock on them, and this backend is trying to insert the first record that updates the progress LSN since the last checkpoint. In this case the checkpoint would be skipped. However, imagine that such a record is able to get through it while scanning the progress values in the WAL insert locks, in which case a checkpoint would be generated. In any case, I think that we should try to get exclusive lock areas as small as possible if we have ways to do so. > I don't think there is any reasonable benefit to improve the situation of > idle-system check for checkpoint other than just including > standby snapshot WAL record. OTOH as checkpoint is not so seldom > operation, so allowing such a change should be okay, but I don't see > the need for same. I am not sure I understand your point here... -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Feb 9, 2016 at 2:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do you see any benefit in allowing checkpoints for such cases considering
> > bgwriter will anyway take care of logging standby snapshot for such
> > cases?
>
> Well, the idea is to improve the system responsiveness. Imagine that
> the call to GetProgressRecPtr() is done within the exclusive lock
> portion, but that while scanning the WAL insert lock entries another
> backend is waiting to take a lock on them, and this backend is trying
> to insert the first record that updates the progress LSN since the
> last checkpoint. In this case the checkpoint would be skipped.
> However, imagine that such a record is able to get through it while
> scanning the progress values in the WAL insert locks, in which case a
> checkpoint would be generated.
>
> On Tue, Feb 9, 2016 at 2:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do you see any benefit in allowing checkpoints for such cases considering
> > bgwriter will anyway take care of logging standby snapshot for such
> > cases?
>
> Well, the idea is to improve the system responsiveness. Imagine that
> the call to GetProgressRecPtr() is done within the exclusive lock
> portion, but that while scanning the WAL insert lock entries another
> backend is waiting to take a lock on them, and this backend is trying
> to insert the first record that updates the progress LSN since the
> last checkpoint. In this case the checkpoint would be skipped.
> However, imagine that such a record is able to get through it while
> scanning the progress values in the WAL insert locks, in which case a
> checkpoint would be generated.
Such case was not covered without your patch and I don't see the
need of same especially at the cost of taking locks.
> In any case, I think that we should try
> to get exclusive lock areas as small as possible if we have ways to do
> so.
>
> to get exclusive lock areas as small as possible if we have ways to do
> so.
>
Sure, but not when you are already going to take lock for longer
duration.
- last_snapshot_lsn != GetXLogInsertRecPtr())
+
GetLastCheckpointRecPtr() < GetProgressRecPtr())
How the above check in patch suppose to work?
I think it could so happen once bgwriter logs snapshot, both checkpoint
and progressAt won't get updated any further and I think this check will
allow to log snapshots in such a case as well.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote: > On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote: >> Well, the idea is to improve the system responsiveness. Imagine that >> the call to GetProgressRecPtr() is done within the exclusive lock >> portion, but that while scanning the WAL insert lock entries another >> backend is waiting to take a lock on them, and this backend is trying >> to insert the first record that updates the progress LSN since the >> last checkpoint. In this case the checkpoint would be skipped. >> However, imagine that such a record is able to get through it while >> scanning the progress values in the WAL insert locks, in which case a >> checkpoint would be generated. > > Such case was not covered without your patch and I don't see the > need of same especially at the cost of taking locks. In this code path that's quite cheap, and it clearly improves the decision-making when wal_level >= hs which is now rather broken (say not optimized much). >> In any case, I think that we should try >> to get exclusive lock areas as small as possible if we have ways to do >> so. >> > > Sure, but not when you are already going to take lock for longer > duration. Why would an exclusive lock be taken for a longer duration in the checkpoint portion? Do you mean that the global time to take exclusive locks gets increased? For each individual WAL insert lock, yes that's the case, but that's still far cheaper to have this evaluation logic here than in ReserveXLogInsertLocation(). > - last_snapshot_lsn != GetXLogInsertRecPtr()) > + > GetLastCheckpointRecPtr() < GetProgressRecPtr()) > > How the above check in patch suppose to work? > I think it could so happen once bgwriter logs snapshot, both checkpoint > and progressAt won't get updated any further and I think this check will > allow to log snapshots in such a case as well. The only purpose of this check is to do the following thing: if no WAL activity has happened since last checkpoint, there is no need to log a standby snapshot from the perspective of the bgwriter. In case the system is idle, we want to skip logging that and avoid unnecessary checkpoints because those records would have been generated. If the system is not idle, or to put it in other words there has been at least one record since the last checkpoint, we would log a standby snapshot, enforcing as well a checkpoint to happen the next time CreateCheckpoint() is gone through, and a standby snapshot is logged as well after the checkpoint contents are flushed. I am not sure I understand what you are getting at... -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote:
> > On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote:
> >> Well, the idea is to improve the system responsiveness. Imagine that
> >> the call to GetProgressRecPtr() is done within the exclusive lock
> >> portion, but that while scanning the WAL insert lock entries another
> >> backend is waiting to take a lock on them, and this backend is trying
> >> to insert the first record that updates the progress LSN since the
> >> last checkpoint. In this case the checkpoint would be skipped.
> >> However, imagine that such a record is able to get through it while
> >> scanning the progress values in the WAL insert locks, in which case a
> >> checkpoint would be generated.
> >
> > Such case was not covered without your patch and I don't see the
> > need of same especially at the cost of taking locks.
>
> In this code path that's quite cheap, and it clearly improves the
> decision-making when wal_level >= hs which is now rather broken (say
> not optimized much).
>
> >> In any case, I think that we should try
> >> to get exclusive lock areas as small as possible if we have ways to do
> >> so.
> >>
> >
> > Sure, but not when you are already going to take lock for longer
> > duration.
>
> Why would an exclusive lock be taken for a longer duration in the
> checkpoint portion?
Consider below code:
>
> > - last_snapshot_lsn != GetXLogInsertRecPtr())
> > +
> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
> >
> > How the above check in patch suppose to work?
> > I think it could so happen once bgwriter logs snapshot, both checkpoint
> > and progressAt won't get updated any further and I think this check will
> > allow to log snapshots in such a case as well.
>
> The only purpose of this check is to do the following thing: if no WAL
> activity has happened since last checkpoint, there is no need to log a
> standby snapshot from the perspective of the bgwriter. In case the
> system is idle, we want to skip logging that and avoid unnecessary
> checkpoints because those records would have been generated. If the
> system is not idle, or to put it in other words there has been at
> least one record since the last checkpoint, we would log a standby
> snapshot, enforcing as well a checkpoint to happen the next time
> CreateCheckpoint() is gone through, and a standby snapshot is logged
> as well after the checkpoint contents are flushed. I am not sure I
> understand what you are getting at...
>
>
> On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote:
> > On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote:
> >> Well, the idea is to improve the system responsiveness. Imagine that
> >> the call to GetProgressRecPtr() is done within the exclusive lock
> >> portion, but that while scanning the WAL insert lock entries another
> >> backend is waiting to take a lock on them, and this backend is trying
> >> to insert the first record that updates the progress LSN since the
> >> last checkpoint. In this case the checkpoint would be skipped.
> >> However, imagine that such a record is able to get through it while
> >> scanning the progress values in the WAL insert locks, in which case a
> >> checkpoint would be generated.
> >
> > Such case was not covered without your patch and I don't see the
> > need of same especially at the cost of taking locks.
>
> In this code path that's quite cheap, and it clearly improves the
> decision-making when wal_level >= hs which is now rather broken (say
> not optimized much).
>
> >> In any case, I think that we should try
> >> to get exclusive lock areas as small as possible if we have ways to do
> >> so.
> >>
> >
> > Sure, but not when you are already going to take lock for longer
> > duration.
>
> Why would an exclusive lock be taken for a longer duration in the
> checkpoint portion?
Consider below code:
/*
+ * Get progress before acquiring insert locks to shorten the locked
+ * section waiting ahead.
+ */
+ progress_lsn = GetProgressRecPtr();
+
+ /*
* We must block concurrent insertions while examining insert state to
* determine the checkpoint REDO pointer.
*/
WALInsertLockAcquireExclusive();
In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
retrieve the latest value of progressAt and then again it will take
all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
do some work. Now I think it is okay to retrieve the latest of progressAt
after WALInsertLockAcquireExclusive() as you don't need to take the
locks again.
>
> > - last_snapshot_lsn != GetXLogInsertRecPtr())
> > +
> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
> >
> > How the above check in patch suppose to work?
> > I think it could so happen once bgwriter logs snapshot, both checkpoint
> > and progressAt won't get updated any further and I think this check will
> > allow to log snapshots in such a case as well.
>
> The only purpose of this check is to do the following thing: if no WAL
> activity has happened since last checkpoint, there is no need to log a
> standby snapshot from the perspective of the bgwriter. In case the
> system is idle, we want to skip logging that and avoid unnecessary
> checkpoints because those records would have been generated. If the
> system is not idle, or to put it in other words there has been at
> least one record since the last checkpoint, we would log a standby
> snapshot, enforcing as well a checkpoint to happen the next time
> CreateCheckpoint() is gone through, and a standby snapshot is logged
> as well after the checkpoint contents are flushed. I am not sure I
> understand what you are getting at...
>
Let me try to say again, suppose ControlFile->checkPoint is at
100 and latest value of progressAt returned by GetProgressRecPtr
is 105, so the first time the above check happens, it will allow
to log standby snapshot which is okay, now assume there is no
activity, neither there is any checkpoint and again after
LOG_SNAPSHOT_INTERVAL_MS interval, when the above checkgets executed, it will pass and log the standby snapshot which is
*not* okay, because we don't want to log standby snapshot when
there is no activity. Am I missing something?
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Feb 10, 2016 at 12:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote:
>> > On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote:
>> >> Well, the idea is to improve the system responsiveness. Imagine that
>> >> the call to GetProgressRecPtr() is done within the exclusive lock
>> >> portion, but that while scanning the WAL insert lock entries another
>> >> backend is waiting to take a lock on them, and this backend is trying
>> >> to insert the first record that updates the progress LSN since the
>> >> last checkpoint. In this case the checkpoint would be skipped.
>> >> However, imagine that such a record is able to get through it while
>> >> scanning the progress values in the WAL insert locks, in which case a
>> >> checkpoint would be generated.
>> >
>> > Such case was not covered without your patch and I don't see the
>> > need of same especially at the cost of taking locks.
>>
>> In this code path that's quite cheap, and it clearly improves the
>> decision-making when wal_level >= hs which is now rather broken (say
>> not optimized much).
>>
>> >> In any case, I think that we should try
>> >> to get exclusive lock areas as small as possible if we have ways to do
>> >> so.
>> >>
>> >
>> > Sure, but not when you are already going to take lock for longer
>> > duration.
>>
>> Why would an exclusive lock be taken for a longer duration in the
>> checkpoint portion?
>
> Consider below code:
>
> /*
> + * Get progress before acquiring insert locks to shorten the locked
> + * section waiting ahead.
> + */
> + progress_lsn = GetProgressRecPtr();
> +
> + /*
> * We must block concurrent insertions while examining insert state to
> * determine the checkpoint REDO pointer.
> */
> WALInsertLockAcquireExclusive();
>
> In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
> retrieve the latest value of progressAt and then again it will take
> all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
> do some work. Now I think it is okay to retrieve the latest of progressAt
> after WALInsertLockAcquireExclusive() as you don't need to take the
> locks again.
Sure, that would be OK. Now I am not on board if this means to have the checkpoint take an exclusive locks for a bit longer duration if that's not actually necessary.
>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?
Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)
--
Michael
Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Wed, Feb 10, 2016 at 12:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
Sure, that would be OK. Now I am not on board if this means to have the checkpoint take an exclusive locks for a bit longer duration if that's not actually necessary.
On Wed, Feb 10, 2016 at 12:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>
> Consider below code:
>
> /*
> + * Get progress before acquiring insert locks to shorten the locked
> + * section waiting ahead.
> + */
> + progress_lsn = GetProgressRecPtr();
> +
> + /*
> * We must block concurrent insertions while examining insert state to
> * determine the checkpoint REDO pointer.
> */
> WALInsertLockAcquireExclusive();
>
> In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
> retrieve the latest value of progressAt and then again it will take
> all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
> do some work. Now I think it is okay to retrieve the latest of progressAt
> after WALInsertLockAcquireExclusive() as you don't need to take the
> locks again.
Taking and releasing locks again and again (which is done in patch)
matters much more than adding few instructions under it and I think
if you would have written the code in-a-way as in patch in some of
the critical path, it would have been regressed badly, but because
checkpoint doesn't happen often, reproducing regression is difficult.
Anyway, there is no-point in too much argument, I think you have
understood what I wanted to say and if you think the way code is
currently written is better, then lets leave as it is for somebody else
to give suggestion on it.
Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)
Why do you think including checkpoint related check is
required, how about something like below:
(now >= timeout &&
last_progress_ptr < current_progress_ptr)
last_progress_ptr < current_progress_ptr)
That's your point?
Yes.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Feb 10, 2016 at 4:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 10, 2016 at 12:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:Sure, that would be OK. Now I am not on board if this means to have the checkpoint take an exclusive locks for a bit longer duration if that's not actually necessary.
On Wed, Feb 10, 2016 at 12:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>
> Consider below code:
>
> /*
> + * Get progress before acquiring insert locks to shorten the locked
> + * section waiting ahead.
> + */
> + progress_lsn = GetProgressRecPtr();
> +
> + /*
> * We must block concurrent insertions while examining insert state to
> * determine the checkpoint REDO pointer.
> */
> WALInsertLockAcquireExclusive();
>
> In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
> retrieve the latest value of progressAt and then again it will take
> all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
> do some work. Now I think it is okay to retrieve the latest of progressAt
> after WALInsertLockAcquireExclusive() as you don't need to take the
> locks again.Taking and releasing locks again and again (which is done in patch)matters much more than adding few instructions under it and I thinkif you would have written the code in-a-way as in patch in some ofthe critical path, it would have been regressed badly, but becausecheckpoint doesn't happen often, reproducing regression is difficult.Anyway, there is no-point in too much argument, I think you haveunderstood what I wanted to say and if you think the way code iscurrently written is better, then lets leave as it is for somebody elseto give suggestion on it.
Okay.
Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)Why do you think including checkpoint related check isrequired, how about something like below:(now >= timeout &&
last_progress_ptr < current_progress_ptr)
We need as well to be sure that no standby snapshots are logged just after a checkpoint in this code path when last_progress_ptr is referring to an LSN position *before* the last checkpoint LSN. There is already one snapshot in CreateCheckpoint() that is logged, there is no need of an extra one, explaining why the check on progress since last checkpoint is necessary to me.
--
Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Wed, Feb 10, 2016 at 1:01 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Feb 10, 2016 at 4:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)Why do you think including checkpoint related check isrequired, how about something like below:(now >= timeout &&
last_progress_ptr < current_progress_ptr)We need as well to be sure that no standby snapshots are logged just after a checkpoint in this code path when last_progress_ptr is referring to an LSN position *before* the last checkpoint LSN. There is already one snapshot in CreateCheckpoint() that is logged, there is no need of an extra one, explaining why the check on progress since last checkpoint is necessary to me.
Okay, but isn't it better that we remove the snapshot taken
at checkpoint time in the main branch or till where this code is
getting back-patched. Do you see any need of same after
having the logging of snapshot in bgwriter?
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, but isn't it better that we remove the snapshot taken > at checkpoint time in the main branch or till where this code is > getting back-patched. Do you see any need of same after > having the logging of snapshot in bgwriter? But this one is necessary as well to allow hot standby faster to initialize, no? Particularly in the case where a bgwriter snapshot would have been taken just before the checkpoint, there may be up to 15s until the next one. And AFAIK, this code would not get a back-patch, as stated by Andres upthread :( I would think that it is better to actually get a backpatch, but well... -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Okay, but isn't it better that we remove the snapshot taken
> > at checkpoint time in the main branch or till where this code is
> > getting back-patched. Do you see any need of same after
> > having the logging of snapshot in bgwriter?
>
> But this one is necessary as well to allow hot standby faster to
> initialize, no? Particularly in the case where a bgwriter snapshot
> would have been taken just before the checkpoint, there may be up to
> 15s until the next one.
>
>
> On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Okay, but isn't it better that we remove the snapshot taken
> > at checkpoint time in the main branch or till where this code is
> > getting back-patched. Do you see any need of same after
> > having the logging of snapshot in bgwriter?
>
> But this one is necessary as well to allow hot standby faster to
> initialize, no? Particularly in the case where a bgwriter snapshot
> would have been taken just before the checkpoint, there may be up to
> 15s until the next one.
>
It could be helpful if checkpoint is done at shorter intervals, otherwise
we anyway log it at 15s interval and if need faster initialisation
of hot-standby, then it is better to reduce the log snapshot interval
in bgwriter. I think if go by my reasoning then it should have been
removed when bgwriter is changed to log snapshot, so I think you
can take this suggestion if you also feel the same, anyway this
is not a correctness issue.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-02-11 09:25:30 +0530, Amit Kapila wrote: > On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: > > > > On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > Okay, but isn't it better that we remove the snapshot taken > > > at checkpoint time in the main branch or till where this code is > > > getting back-patched. Do you see any need of same after > > > having the logging of snapshot in bgwriter? > > > > But this one is necessary as well to allow hot standby faster to > > initialize, no? Particularly in the case where a bgwriter snapshot > > would have been taken just before the checkpoint, there may be up to > > 15s until the next one. > > > > It could be helpful if checkpoint is done at shorter intervals, otherwise > we anyway log it at 15s interval and if need faster initialisation > of hot-standby, then it is better to reduce the log snapshot interval > in bgwriter. No. By emitting the first snapshot directly after the determination of the redo pointer, we can get into STANDBY_SNAPSHOT_READY more quickly. Especially if some of the snapshots are overflowed. During startup 15s can be a *long* time; but on the other hand there's not much benefit at logging snapshots more frequently when the system is up. I don't think we should tinker with the frequency/logging points, while fixing the issue here.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Thu, Feb 11, 2016 at 5:20 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-11 09:25:30 +0530, Amit Kapila wrote: >> On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com> >> wrote: >> > >> > On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > > Okay, but isn't it better that we remove the snapshot taken >> > > at checkpoint time in the main branch or till where this code is >> > > getting back-patched. Do you see any need of same after >> > > having the logging of snapshot in bgwriter? >> > >> > But this one is necessary as well to allow hot standby faster to >> > initialize, no? Particularly in the case where a bgwriter snapshot >> > would have been taken just before the checkpoint, there may be up to >> > 15s until the next one. >> > >> >> It could be helpful if checkpoint is done at shorter intervals, otherwise >> we anyway log it at 15s interval and if need faster initialisation >> of hot-standby, then it is better to reduce the log snapshot interval >> in bgwriter. > > No. By emitting the first snapshot directly after the determination of > the redo pointer, we can get into STANDBY_SNAPSHOT_READY more > quickly. Especially if some of the snapshots are overflowed. During > startup 15s can be a *long* time; but on the other hand there's not much > benefit at logging snapshots more frequently when the system is up. > I don't think we should tinker with the frequency/logging points, while > fixing the issue here. Agreement from here on this point. Andres, what's still remaining on my side is to know how to we detect in XLoginsert() how we update the LSN progress. I was willing to add XLogInsertExtended() with a complementary int/bool flag and let XLogInsert() alone as you don't like much doing this decision making using just the record type as I did in one of the patch upthread, however you did not like this idea either. What exactly do you have in mind? -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Thu, Feb 11, 2016 at 5:46 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 11, 2016 at 5:20 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-02-11 09:25:30 +0530, Amit Kapila wrote: >>> On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com> >>> wrote: >>> > >>> > On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> >>> wrote: >>> > > Okay, but isn't it better that we remove the snapshot taken >>> > > at checkpoint time in the main branch or till where this code is >>> > > getting back-patched. Do you see any need of same after >>> > > having the logging of snapshot in bgwriter? >>> > >>> > But this one is necessary as well to allow hot standby faster to >>> > initialize, no? Particularly in the case where a bgwriter snapshot >>> > would have been taken just before the checkpoint, there may be up to >>> > 15s until the next one. >>> > >>> >>> It could be helpful if checkpoint is done at shorter intervals, otherwise >>> we anyway log it at 15s interval and if need faster initialisation >>> of hot-standby, then it is better to reduce the log snapshot interval >>> in bgwriter. >> >> No. By emitting the first snapshot directly after the determination of >> the redo pointer, we can get into STANDBY_SNAPSHOT_READY more >> quickly. Especially if some of the snapshots are overflowed. During >> startup 15s can be a *long* time; but on the other hand there's not much >> benefit at logging snapshots more frequently when the system is up. >> I don't think we should tinker with the frequency/logging points, while >> fixing the issue here. > > Agreement from here on this point. Andres, what's still remaining on > my side is to know how to we detect in XLoginsert() how we update the > LSN progress. I was willing to add XLogInsertExtended() with a > complementary int/bool flag and let XLogInsert() alone as you don't > like much doing this decision making using just the record type as I > did in one of the patch upthread, however you did not like this idea > either. What exactly do you have in mind? OK, here is attached a new version that I hope addresses all the points raised until now. The following things are changed: - Extend XLogInsert with a new uint8 argument to have flags. As of now there is one flag: XLOG_INSERT_NO_PROGRESS that can be set to not update the progress. By default, the progress LSN is updated. - Add extra check in bgwriter to not log a snapshot to be sure that no snapshots are logged when there is no activity since last snapshot taken, and since last checkpoint. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote: > I'm not really a fan. I'd rather change existing callers to add a > 'flags' bitmask argument. Right now there can't really be XLogInserts() > in extension code, so that's pretty ok to change. Yeah, but to what benefit? You're just turning a smaller patch into a bigger one and requiring churning a bunch of code that wouldn't otherwise need to be touched. I think Michael has a good point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Andres Freund
Date:
On 2016-02-12 12:37:35 -0500, Robert Haas wrote: > On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote: > > I'm not really a fan. I'd rather change existing callers to add a > > 'flags' bitmask argument. Right now there can't really be XLogInserts() > > in extension code, so that's pretty ok to change. > > Yeah, but to what benefit? You're just turning a smaller patch into a > bigger one and requiring churning a bunch of code that wouldn't > otherwise need to be touched. I think Michael has a good point. It has the advantage of not ending up with an extra interface, that we're otherwise never going to get rid of? If not now, when would we remove it? Sure it touches a few more lines, but that's entirely trivial mechanical changes, so what? Andres
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-12 12:37:35 -0500, Robert Haas wrote: >> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote: >> > I'm not really a fan. I'd rather change existing callers to add a >> > 'flags' bitmask argument. Right now there can't really be XLogInserts() >> > in extension code, so that's pretty ok to change. >> >> Yeah, but to what benefit? You're just turning a smaller patch into a >> bigger one and requiring churning a bunch of code that wouldn't >> otherwise need to be touched. I think Michael has a good point. > > It has the advantage of not ending up with an extra interface, that > we're otherwise never going to get rid of? If not now, when would we > remove it? Sure it touches a few more lines, but that's entirely trivial > mechanical changes, so what? I don't think that it's a bad thing to have a simple interface for simple use cases and a complex one for more complex cases. I don't feel any need to convert every use of ReadBuffer() in the source code to ReadBufferExtended(), for example. Nor do I feel like we should have changed every call to LockAcquire() instead of introducing LockAcquireExtended(). This case seems analogous, and there are a few advantages. One is that it avoids creating meaningless conflicts when back-patching. Another is that when a function gets an extra argument, that often turns out to be something that happens more than once. It's nice not to keep having to whack around every call in the tree every time the call signature changes. Also, finding the small number of callers that need non-default behavior is simplified, because you can just grep for the Extended version of the function and there won't be many hits. I don't feel that there's only one right way to do this, but I think Michael's position is both reasonable and similar to what we have done in previous cases of this sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Feb 13, 2016 at 1:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-02-12 12:37:35 -0500, Robert Haas wrote: >>> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote: >>> > I'm not really a fan. I'd rather change existing callers to add a >>> > 'flags' bitmask argument. Right now there can't really be XLogInserts() >>> > in extension code, so that's pretty ok to change. >>> >>> Yeah, but to what benefit? You're just turning a smaller patch into a >>> bigger one and requiring churning a bunch of code that wouldn't >>> otherwise need to be touched. I think Michael has a good point. >> >> It has the advantage of not ending up with an extra interface, that >> we're otherwise never going to get rid of? If not now, when would we >> remove it? Sure it touches a few more lines, but that's entirely trivial >> mechanical changes, so what? Note: the patch has grown from 15kB to 46kB by switching to the extended interface to the addition of an argument in XLogInsert(). > I don't feel that there's only one right way to do this, but I think > Michael's position is both reasonable and similar to what we have done > in previous cases of this sort. To be honest, my heart still balances for the Extended() interface. This reduces the risk of conflicts with back-patching with 9.5. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Sat, Feb 13, 2016 at 2:26 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Feb 13, 2016 at 1:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote: >>> On 2016-02-12 12:37:35 -0500, Robert Haas wrote: >>>> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote: >>>> > I'm not really a fan. I'd rather change existing callers to add a >>>> > 'flags' bitmask argument. Right now there can't really be XLogInserts() >>>> > in extension code, so that's pretty ok to change. >>>> >>>> Yeah, but to what benefit? You're just turning a smaller patch into a >>>> bigger one and requiring churning a bunch of code that wouldn't >>>> otherwise need to be touched. I think Michael has a good point. >>> >>> It has the advantage of not ending up with an extra interface, that >>> we're otherwise never going to get rid of? If not now, when would we >>> remove it? Sure it touches a few more lines, but that's entirely trivial >>> mechanical changes, so what? > > Note: the patch has grown from 15kB to 46kB by switching to the > extended interface to the addition of an argument in XLogInsert(). > >> I don't feel that there's only one right way to do this, but I think >> Michael's position is both reasonable and similar to what we have done >> in previous cases of this sort. > > To be honest, my heart still balances for the Extended() interface. > This reduces the risk of conflicts with back-patching with 9.5. Andres, others, what else can I do to make this thread move on? I can produce any version of this patch depending on committer's requirements. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Amit Kapila
Date:
On Fri, Feb 12, 2016 at 6:57 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
>
> OK, here is attached a new version that I hope addresses all the
> points raised until now. The following things are changed:
> - Extend XLogInsert with a new uint8 argument to have flags. As of now
> there is one flag: XLOG_INSERT_NO_PROGRESS that can be set to not
> update the progress. By default, the progress LSN is updated.
> - Add extra check in bgwriter to not log a snapshot to be sure that no
> snapshots are logged when there is no activity since last snapshot
> taken, and since last checkpoint.
>
>
>
> OK, here is attached a new version that I hope addresses all the
> points raised until now. The following things are changed:
> - Extend XLogInsert with a new uint8 argument to have flags. As of now
> there is one flag: XLOG_INSERT_NO_PROGRESS that can be set to not
> update the progress. By default, the progress LSN is updated.
> - Add extra check in bgwriter to not log a snapshot to be sure that no
> snapshots are logged when there is no activity since last snapshot
> taken, and since last checkpoint.
>
You doesn't seem to have taken care of below typo in your patch as
pointed out by me earlier.
+ * to not rely on taking an exclusive lock an all the WAL insertion locks,
/an all/on all
Does this in anyway take care of the case when there is an activity
on unlogged tables?
I think avoiding to perform checkpoints in an idle system is introduced
in commit 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 whereas
unlogged relation is introduced by commit
53dbc27c62d8e1b6c5253feba04a5094cb8fe046 which is much later.
Now, I think one might consider it okay to not do anything for
unlogged tables as it is not done previously and this patch is
anyway improving the current situation, but I feel if we agree
that checkpoints will get skipped if the only operations that
are happening in the system are on unlogged relations, then
it is better to add it in comments as an improvement even if we
don't want to address it as part of this patch.
+ elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",
+ (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
+ (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);
Are you proposing to have the newly intorduced elog messages
as part of commit or are these just for debugging purpose? If you
are proposing for commit, then I think it is worth to justify the need
of same and we should discuss what is the appropriate log level,
otherwise, I think you can have these in an additional patch just for
verification as the main patch is now very close to being
ready for committer.
Also, I think it is worth to once take the performance data for
write tests (using pgbench 30 minute run or some other way) with
minimum checkpoint_timeout (i.e 30s) to see if the additional locking
has any impact on performance. I think taking locks at intervals
of 15s or 30s should not matter much, but it is better to be safe.
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Fri, Feb 19, 2016 at 10:33 PM, Amit Kapila wrote: > On Fri, Feb 12, 2016 at 6:57 PM, Michael Paquier wrote: > You doesn't seem to have taken care of below typo in your patch as > pointed out by me earlier. > > + * to not rely on taking an exclusive lock an all the WAL insertion locks, > > /an all/on all Sorry about that. Hopefully fixed now. > Does this in anyway take care of the case when there is an activity > on unlogged tables? > I think avoiding to perform checkpoints in an idle system is introduced > in commit 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 whereas > unlogged relation is introduced by commit > 53dbc27c62d8e1b6c5253feba04a5094cb8fe046 which is much later. > Now, I think one might consider it okay to not do anything for > unlogged tables as it is not done previously and this patch is > anyway improving the current situation, but I feel if we agree > that checkpoints will get skipped if the only operations that > are happening in the system are on unlogged relations, then > it is better to add it in comments as an improvement even if we > don't want to address it as part of this patch. Nope, nothing is done, just to not complicate the patch more than necessary. I agree though that we had better not update the progress LSN when logging something related to INIT_FORKNUM, there is little point to run non-forced checkpoints in this case as on recovery unlogged relations are wiped out. > + elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt > %X/%X", > + (uint32) (progress_lsn >> 32), (uint32) progress_lsn, > + (uint32) (ControlFile->checkPoint >> 32), (uint32) > ControlFile->checkPoint); > > Are you proposing to have the newly introduced elog messages > as part of commit or are these just for debugging purpose? If you > are proposing for commit, then I think it is worth to justify the need > of same and we should discuss what is the appropriate log level, > otherwise, I think you can have these in an additional patch just for > verification as the main patch is now very close to being > ready for committer. I have never intended to have those logs being part of the patch that should be committed, and putting them at LOG level avoided to have a bunch of garbage in the logs during the tests (got lazy to grep on the entries for those tests). I am no:t against adding a comment, here is one I just came up with: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -448,6 +448,12 @@ typedef struct XLogwrtResult * hence reducing the impact of the activity lookup. This takes also * advantage to avoid 8-byte torn reads on some platforms by using the * fact that each insert lock is located on the same cache line. + * XXX: There is still room for more improvements here, particularly + * WAL operations related to unlogged relations (INIT_FORKNUM) should not + * update the progress LSN as those relations are reset during crash + * recovery so enforcing buffers of such relations to be flushed for + * example in the case of a load only on unlogged relations is a waste + * of disk write. */ > Also, I think it is worth to once take the performance data for > write tests (using pgbench 30 minute run or some other way) with > minimum checkpoint_timeout (i.e 30s) to see if the additional locking > has any impact on performance. I think taking locks at intervals > of 15s or 30s should not matter much, but it is better to be safe. I don't think performance will be impacted, but there are no reasons to not do any measurements either. I'll try to get some graphs tomorrow with runs on my laptop, mainly looking for any effects of this patch on TPS when checkpoints show up. For now attached is an updated patch, with a second patch (b) including the logs for testing. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Robert Haas
Date:
On Fri, Feb 19, 2016 at 6:32 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> To be honest, my heart still balances for the Extended() interface. >> This reduces the risk of conflicts with back-patching with 9.5. > > Andres, others, what else can I do to make this thread move on? I can > produce any version of this patch depending on committer's > requirements. FWIW, I don't expect to have time to review this in the level of detail that would make me confident to commit it any time soon. I've said my piece on what I think the final patch should look like, and I hope that argument was persuasive, but I don't have anything further to add to what I already said. I hope some other committer has some cycles to look at this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Tue, Feb 23, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 19, 2016 at 6:32 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>> To be honest, my heart still balances for the Extended() interface. >>> This reduces the risk of conflicts with back-patching with 9.5. >> >> Andres, others, what else can I do to make this thread move on? I can >> produce any version of this patch depending on committer's >> requirements. > > FWIW, I don't expect to have time to review this in the level of > detail that would make me confident to commit it any time soon. I've > said my piece on what I think the final patch should look like, and I > hope that argument was persuasive, but I don't have anything further > to add to what I already said. I hope some other committer has some > cycles to look at this. This has the merit to be clear, thanks for the input. Whatever the approach taken at the end we have two candidates: - Extend XLogInsert() with an extra argument for flags (Andres) - Introduce XLogInsertExtended with this extra argument and let XLogInsert() in peace (Robert and I). Actually, I lied, there was still something I could do for this thread: attached are two patches implementing both approaches as respectively a-1 and a-2. Patch b is the set of logs I used for the tests to show with a low checkpoint_timeout that checkpoints are getting correctly skipped on an idle system. Let's see what happens next. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Wed, Feb 24, 2016 at 2:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > This has the merit to be clear, thanks for the input. Whatever the > approach taken at the end we have two candidates: > - Extend XLogInsert() with an extra argument for flags (Andres) > - Introduce XLogInsertExtended with this extra argument and let > XLogInsert() in peace (Robert and I). > Actually, I lied, there was still something I could do for this > thread: attached are two patches implementing both approaches as > respectively a-1 and a-2. Patch b is the set of logs I used for the > tests to show with a low checkpoint_timeout that checkpoints are > getting correctly skipped on an idle system. > > Let's see what happens next. FWIW, I just made a couple of 5-min pgbench runs on my laptop (scale 100, 24 clients using 4 threads) with checkpoint_timeout = 30s and a small value of shared_buffers to minimize the work of each checkpoint, and I am seeing similar spikes with variations that can be assimilated to noise. See for example the graph attached. Trying to put more contention on the extra locks taken before entering the checkpoint section to scan the progress LSN looks kind of difficult :) But that's not really a surprise. -- Michael
Attachment
Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
David Steele
Date:
On 2/24/16 12:40 AM, Michael Paquier wrote: > This has the merit to be clear, thanks for the input. Whatever the > approach taken at the end we have two candidates: > - Extend XLogInsert() with an extra argument for flags (Andres) > - Introduce XLogInsertExtended with this extra argument and let > XLogInsert() in peace (Robert and I). > Actually, I lied, there was still something I could do for this > thread: attached are two patches implementing both approaches as > respectively a-1 and a-2. Patch b is the set of logs I used for the > tests to show with a low checkpoint_timeout that checkpoints are > getting correctly skipped on an idle system. Unfortunately neither A nor B apply anymore. However, since the patches can still be read through I wonder if Robert or Andres would care to opine on whether A or B is better now that they can see the full implementation? For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare use case where flags are required. This can always be refactored in the future if/when the use of flags spreads. I think it would be good to make a decision on this before asking Michael to rebase. Thanks, -- -David david@pgmasters.net
Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Mar 14, 2016 at 6:46 PM, David Steele <david@pgmasters.net> wrote: > On 2/24/16 12:40 AM, Michael Paquier wrote: > >> This has the merit to be clear, thanks for the input. Whatever the >> approach taken at the end we have two candidates: >> - Extend XLogInsert() with an extra argument for flags (Andres) >> - Introduce XLogInsertExtended with this extra argument and let >> XLogInsert() in peace (Robert and I). >> Actually, I lied, there was still something I could do for this >> thread: attached are two patches implementing both approaches as >> respectively a-1 and a-2. Patch b is the set of logs I used for the >> tests to show with a low checkpoint_timeout that checkpoints are >> getting correctly skipped on an idle system. > > > Unfortunately neither A nor B apply anymore. > > However, since the patches can still be read through I wonder if Robert or > Andres would care to opine on whether A or B is better now that they can see > the full implementation? > > For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare > use case where flags are required. This can always be refactored in the > future if/when the use of flags spreads. > > I think it would be good to make a decision on this before asking Michael to > rebase. That's a bit embarrassing, the last versions should be able to apply cleanly as there have not been changes in this area of the code lately... But... I did a mistake when generating the patches by diff'ing them from an incorrect commit number... This explains why they exploded in size, so attached are the corrected rebased versions. Too many patches I guess.. And both of them are attached by the way. -- Michael
Attachment
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Simon Riggs
Date:
On 14 March 2016 at 17:46, David Steele <david@pgmasters.net> wrote:
--
On 2/24/16 12:40 AM, Michael Paquier wrote:This has the merit to be clear, thanks for the input. Whatever the
approach taken at the end we have two candidates:
- Extend XLogInsert() with an extra argument for flags (Andres)
- Introduce XLogInsertExtended with this extra argument and let
XLogInsert() in peace (Robert and I).
Actually, I lied, there was still something I could do for this
thread: attached are two patches implementing both approaches as
respectively a-1 and a-2. Patch b is the set of logs I used for the
tests to show with a low checkpoint_timeout that checkpoints are
getting correctly skipped on an idle system.
Unfortunately neither A nor B apply anymore.
However, since the patches can still be read through I wonder if Robert or Andres would care to opine on whether A or B is better now that they can see the full implementation?
For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare use case where flags are required. This can always be refactored in the future if/when the use of flags spreads.
XLogInsertExtended() is the one I would commit, if.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Simon Riggs
Date:
On 3 April 2016 at 21:32, Simon Riggs <simon@2ndquadrant.com> wrote:
--
On 14 March 2016 at 17:46, David Steele <david@pgmasters.net> wrote:On 2/24/16 12:40 AM, Michael Paquier wrote:This has the merit to be clear, thanks for the input. Whatever the
approach taken at the end we have two candidates:
- Extend XLogInsert() with an extra argument for flags (Andres)
- Introduce XLogInsertExtended with this extra argument and let
XLogInsert() in peace (Robert and I).
Actually, I lied, there was still something I could do for this
thread: attached are two patches implementing both approaches as
respectively a-1 and a-2. Patch b is the set of logs I used for the
tests to show with a low checkpoint_timeout that checkpoints are
getting correctly skipped on an idle system.
Unfortunately neither A nor B apply anymore.
However, since the patches can still be read through I wonder if Robert or Andres would care to opine on whether A or B is better now that they can see the full implementation?
For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare use case where flags are required. This can always be refactored in the future if/when the use of flags spreads.XLogInsertExtended() is the one I would commit, if.
I'm not very excited about this patch. Too much code for so little benefit and fragile too.
I'm not even sure what definition of "meaningful progress" is useful. If we commit this, a similar bug could be filed for many similar WAL record types.
Since we zero out WAL files specifically to allow them to be compressed in the archive, this patch saves a few bytes in the archive. Seems like we could fake up some WAL files if needed for restore with an external tool, if we really care.
Since we agree it wouldn't be backpatched, its pretty much saying we don't care.
A promising approach would be to just skip logging the RUNNING_XACTS record if there are no xacts running, which is the case we care about here (no progress => no xacts). HS startup can only happen at a checkpoint, so if there is no WAL file there is no checkpoint, so we don't need the RUNNING_XACTS record to be written. That's a short patch.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Apr 4, 2016 at 6:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm not very excited about this patch. Too much code for so little benefit > and fragile too. > > I'm not even sure what definition of "meaningful progress" is useful. If we > commit this, a similar bug could be filed for many similar WAL record types. Up to now, "progress" means the addition of WAL records by backends, "meaningful" refers to records that impact the decision of skipping checkpoints or not. > Since we zero out WAL files specifically to allow them to be compressed in > the archive, this patch saves a few bytes in the archive. Seems like we > could fake up some WAL files if needed for restore with an external tool, if > we really care. This assumes that most deployments compress the segments when archiving them for simplicity. I would believe that the reverse is true, most of archive command processes do not compress them. And I am pretty sure that some cloud deployments have as well pricing models depending on the number of accesses done on an instance. > Since we agree it wouldn't be backpatched, its pretty much saying we don't > care. Seeing the flow of this thread, that's not completely true. Andres has been arguing for no backpatch, Amit and I not, because the checkpoint skip logic is incorrect. The logic to decide is a checkpoint should be skipped or not is still broken on back-branches for wal_level = hot_standby. That's related to the former complain reported here. > A promising approach would be to just skip logging the RUNNING_XACTS record > if there are no xacts running, which is the case we care about here (no > progress => no xacts). HS startup can only happen at a checkpoint, so if > there is no WAL file there is no checkpoint, so we don't need the > RUNNING_XACTS record to be written. That's a short patch. I am not sure if that's actually simpler... Do you mean here that we should save the latest transaction ID at the moment a snapshot is taken and skip the next snapshot if TXID has not progressed? Say in LogStandbySnapshot() when a snapshot is taken, we allocate the last XID with ReadNewTransactionId() in XlogCtlData, then compare it the next time a standby snapshot is taken. With such a logic, you need to do something like that: - Save the last transaction XID when RUNNING_XACTS has been logged in shmem - Save in shmem as well if the previous RUNNING_XACTS has overflowed its subxids or not. If it has overflowed, we should log RUNNING_XACTS next time LogStandbySnapshot is logged to prevent the case of pending snapshots at recovery. AFAIK, standby snapshots with no xacts running are still useful at recovery for two things to accelerate a hot standby initialization: - If the previous snapshot overflowed its subxids, the next empty snapshot can help out clear the situation. That's the case where the bgwriter snapshot is helpful actually, because there is no need to wait for the next checkpoint record to be replayed to initialize a hot standby. - An empty snapshot after checkpoint is mandatory, even if it is empty, no? Knowing that, the next checkpoint cannot be skipped because the logic mentioned above in xlog.c to decide if a checkpoint should be skipped or not ignores the fact that a standby snapshot has been logged. It bases its logic on the sole presence of a checkpoint record, so this will never work. I mean this check of course: if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY| CHECKPOINT_FORCE)) == 0) { if (prevPtr ==ControlFile->checkPointCopy.redo && prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE) { WALInsertLockRelease(); LWLockRelease(CheckpointLock); END_CRIT_SECTION(); return; } } The approach introducing the concept of WAL progress addresses the problem of unnecessary checkpoints and of unnecessary standby snapshots. If we take the problem only from the angle of RUNNING_XACTS the current logic used to check if a checkpoint should be skipped or not is not addressed. -- Michael
Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
From
Michael Paquier
Date:
On Mon, Apr 4, 2016 at 3:22 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > The approach introducing the concept of WAL progress addresses the > problem of unnecessary checkpoints and of unnecessary standby > snapshots. If we take the problem only from the angle of RUNNING_XACTS > the current logic used to check if a checkpoint should be skipped or > not is not addressed. This discussion has been stalling for a while regarding the main patches... Attached are rebased versions based on the approach with XLogInsertExtended(). And I have added as well a couple of days ago an entry on the 9.6 open item page, not as a 9.6 open item, but as an older bug to keep at least track of this problem. -- Michael