Thread: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Jeff Janes

Date:

02 November 2015, 05:59:05

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

Simple patch, applies and makes cleanly, does what it says and says what it does.

If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway.
 

Marking it ready for committer.

The new status of this patch is: Ready for Committer

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

02 November 2015, 07:10:04

On Mon, Nov 2, 2015 at 2:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Simple patch, applies and makes cleanly, does what it says and says what it does.
>
> If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway. 
>
> Marking it ready for committer.
>
> The new status of this patch is: Ready for Committer

Thanks! That was deadly fast.

Just wondering: shouldn't we keep the discussion around this patch on
-bugs instead? Not saying you are wrong, Jeff, I am just not sure what
would be the best practice regarding patches related to bugs. I would
think that it is at least necessary to keep the person who reported
the bug in CC to let him know the progress though.
--
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Jeff Janes

Date:

02 November 2015, 07:34:40

On Sun, Nov 1, 2015 at 11:09 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Nov 2, 2015 at 2:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Simple patch, applies and makes cleanly, does what it says and says what it does.
>>
>> If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway. 
>>
>> Marking it ready for committer.
>>
>> The new status of this patch is: Ready for Committer
>
> Thanks! That was deadly fast.
>
> Just wondering: shouldn't we keep the discussion around this patch on
> -bugs instead? Not saying you are wrong, Jeff, I am just not sure what
> would be the best practice regarding patches related to bugs. I would
> think that it is at least necessary to keep the person who reported
> the bug in CC to let him know the progress though.

I wasn't sure about -bugs vs -hackers either, but in this case I used
the review form built into the commit-fest app, and the app is what
sent the email.  As far as I know I can't change its destination or
its CC list, even if I had thought ahead to do so.

I think the bug reporter should certainly be CCed when the bug is
closed, I don't know about intermediate steps in the "sausage making"
process.  Something to think about for a bug-tracker we might
implement in the future.  I think most bugs are summarily handled by
committers, so don't go through the commitfest process at all.

Cheers,

Jeff

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

03 November 2015, 15:23:41

On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway. 

I'm sure other people here understand this better than me, but I
wonder if it wouldn't make more sense to somehow log this data only if
something material has changed in the data being logged.  This seems
to be trying to log something only if something else has been written
to WAL, which I'm not sure is the right test.

Also, this check here:

+                               if (last_snapshot_lsn != insert_lsn &&
+                                       checkpoint_lsn != insert_lsn &&
+                                       checkpoint_lsn != previous_lsn)

...seems like it will fire if there have been 0 or 1 WAL records since
the last checkpoint, regardless of what they are.  I'm not sure that's
the right test, and it'll break again the minute we have a third thing
we want to log only if the system is non-idle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

03 November 2015, 15:43:12

On 2015-11-03 10:23:35 -0500, Robert Haas wrote:
> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway.

> 
> I'm sure other people here understand this better than me, but I
> wonder if it wouldn't make more sense to somehow log this data only if
> something material has changed in the data being logged.

Phew. That doesn't seem easy to measure. I'm doubtful that it's worth
comparing the snapshot and such, especially in the back
branches.

We could maybe add something that we only log a snapshot if XXX
megabytes have been logged or something. But I don't know which number
to pick here - and if there's other write activity the price of a
snapshot record really isn't high.

Andres

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

03 November 2015, 23:37:09

On Wed, Nov 4, 2015 at 12:43 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-03 10:23:35 -0500, Robert Haas wrote:
>> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway. 
>>
>> I'm sure other people here understand this better than me, but I
>> wonder if it wouldn't make more sense to somehow log this data only if
>> something material has changed in the data being logged.
>
> Phew. That doesn't seem easy to measure. I'm doubtful that it's worth
> comparing the snapshot and such, especially in the back
> branches.

Well, I guess that's why I thought it would be more simple to check if
we are at the beginning of a segment at first sight. This has no
chance to break if anything else like that is being added in the
future as it doesn't depend on the record types, though new similar
records added on a timely manner would need a similar check. Perhaps
this could be coupled by a check on the last XLOG_SWITCH_XLOG record
instead of checkpoint activity though.

> We could maybe add something that we only log a snapshot if XXX
> megabytes have been logged or something. But I don't know which number
> to pick here - and if there's other write activity the price of a
> snapshot record really isn't high.

On a completely idle system, I don't think we should log any standby
records. This is what ~9.3 does.
--
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

03 November 2015, 23:39:20

On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
>On Wed, Nov 4, 2015 at 12:43 AM, Andres Freund <andres@anarazel.de>
>wrote:
>> On 2015-11-03 10:23:35 -0500, Robert Haas wrote:
>>> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com>
>wrote:
>>> > If a transaction holding locks aborts on an otherwise idle server,
>perhaps it will take a very long time for a log-shipping standby to
>realize this.  But I have hard time believing that anyone who cares
>about that would be using log-shipping (rather than streaming) anyway.
>>>
>>> I'm sure other people here understand this better than me, but I
>>> wonder if it wouldn't make more sense to somehow log this data only
>if
>>> something material has changed in the data being logged.
>>
>> Phew. That doesn't seem easy to measure. I'm doubtful that it's worth
>> comparing the snapshot and such, especially in the back
>> branches.
>
>Well, I guess that's why I thought it would be more simple to check if
>we are at the beginning of a segment at first sight. This has no
>chance to break if anything else like that is being added in the
>future as it doesn't depend on the record types, though new similar
>records added on a timely manner would need a similar check. Perhaps
>this could be coupled by a check on the last XLOG_SWITCH_XLOG record
>instead of checkpoint activity though.
>
>> We could maybe add something that we only log a snapshot if XXX
>> megabytes have been logged or something. But I don't know which
>number
>> to pick here - and if there's other write activity the price of a
>> snapshot record really isn't high.
>
>On a completely idle system, I don't think we should log any standby
>records. This is what ~9.3 does.

Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn't
seemto be the case.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

04 November 2015, 07:01:38

On Wed, Nov 4, 2015 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote:
> On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier wrote:
>>On a completely idle system, I don't think we should log any standby
>>records. This is what ~9.3 does.
>
> Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn't
seemto be the case.
 

Er, yes, sorry. I should have used clearer words: I meant idle system
with something running nothing including internal checkpoints. An
instance indeed generates a XLOG_RUNNING_XACTS record before a
checkpoint record on an idle system.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

04 November 2015, 07:09:45

On Wed, Nov 4, 2015 at 12:43 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-03 10:23:35 -0500, Robert Haas wrote:
>> On Mon, Nov 2, 2015 at 12:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > If a transaction holding locks aborts on an otherwise idle server, perhaps it will take a very long time for a
log-shippingstandby to realize this.  But I have hard time believing that anyone who cares about that would be using
log-shipping(rather than streaming) anyway. 
>>
>> I'm sure other people here understand this better than me, but I
>> wonder if it wouldn't make more sense to somehow log this data only if
>> something material has changed in the data being logged.
>
> Phew. That doesn't seem easy to measure. I'm doubtful that it's worth
> comparing the snapshot and such, especially in the back
> branches.
>
> We could maybe add something that we only log a snapshot if XXX
> megabytes have been logged or something. But I don't know which number
> to pick here - and if there's other write activity the price of a
> snapshot record really isn't high.

My first guess on the matter is that we would like to have an extra
condition that depends on max_wal_size with at least a minimum number
of segments generated since the last standby snapshot, perhaps
max_wal_size / 16, but this coefficient is clearly a rule of thumb.
With the default configuration of 1GB, that would be waiting for 4
segments to be generated before logging in a standby snapshot.
--
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

04 November 2015, 10:34:08

On 2015-11-04 16:01:28 +0900, Michael Paquier wrote:
> On Wed, Nov 4, 2015 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote:
> > On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier wrote:
> >>On a completely idle system, I don't think we should log any standby
> >>records. This is what ~9.3 does.
> >
> > Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that doesn't
seemto be the case.

> 
> Er, yes, sorry. I should have used clearer words: I meant idle system
> with something running nothing including internal checkpoints.

Uh, but you'll always have checkpoints happen on wal_level =
hot_standby, even in 9.3?  Maybe I'm not parsing your sentence right.

As soon as a single checkpoint ever happened the early-return logic in
CreateCheckPoint() will fail to take the LogStandbySnapshot() in
CreateCheckPoint() into account. The test is:   if (curInsert == ControlFile->checkPoint +MAXALIGN(SizeOfXLogRecord +
sizeof(CheckPoint))&&ControlFile->checkPoint == ControlFile->checkPointCopy.redo)

which obviously doesn't work if there's been a WAL record logged after
the redo pointer has been determined etc.

The reason that a single checkpoint is needed to "jumpstart" the
pointless checkpoints is that otherwise we'll never have issued a
LogStandbySnapshot() and thus the above code block works if we started
from a proper shutdown checkpoint.

Independent of the idle issue, it seems to me that the location of the
LogStandbySnapshot() is actually rather suboptimal - it really should
really be before the CheckPointGuts(), not afterwards. As closer it's to
the redo pointer of the checkpoint a hot standby node starts up from,
the sooner that node can reach consistency.  There's no difference for
the first time a node starts from a basebackup (since we gotta replay
that checkpoint anyway before we're consistent), but if we start from a
restartpoint...

Greetings,

Andres Freund

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

05 November 2015, 06:01:00

On Wed, Nov 4, 2015 at 7:33 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-04 16:01:28 +0900, Michael Paquier wrote:
>> On Wed, Nov 4, 2015 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote:
>> > On November 4, 2015 12:37:02 AM GMT+01:00, Michael Paquier wrote:
>> >>On a completely idle system, I don't think we should log any standby
>> >>records. This is what ~9.3 does.
>> >
>> > Are you sure? I think it'll around checkpoints, no? I thought Heikki had fixed that, but looking sound that
doesn'tseem to be the case.
 
>>
>> Er, yes, sorry. I should have used clearer words: I meant idle system
>> with something running nothing including internal checkpoints.
>
> Uh, but you'll always have checkpoints happen on wal_level =
> hot_standby, even in 9.3?  Maybe I'm not parsing your sentence right.

Reading again my previous sentence I cannot get the meaning of it
myself :) Well, I just meant that in ~9.3 LogStandbySnapshot() is
called at each checkpoint, checkpoints occurring after
checkpoint_timeout even if the system is idle.

> As soon as a single checkpoint ever happened the early-return logic in
> CreateCheckPoint() will fail to take the LogStandbySnapshot() in
> CreateCheckPoint() into account. The test is:
>     if (curInsert == ControlFile->checkPoint +
>         MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
>         ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
> which obviously doesn't work if there's been a WAL record logged after
> the redo pointer has been determined etc.

Yes. If segment switches are enforced at a pace faster than
checkpoint_timeout, this check considers that a checkpoint needs to
happen because a SWITCH_XLOG record is in-between. I am a bit
surprised that this should happen actually. The segment switch
triggers a checkpoint record, and vice-versa, even for idle systems.
Shouldn't we make this check a bit smarter then?

> The reason that a single checkpoint is needed to "jumpstart" the
> pointless checkpoints is that otherwise we'll never have issued a
> LogStandbySnapshot() and thus the above code block works if we started
> from a proper shutdown checkpoint.
>
> Independent of the idle issue, it seems to me that the location of the
> LogStandbySnapshot() is actually rather suboptimal - it really should
> really be before the CheckPointGuts(), not afterwards. As closer it's to
> the redo pointer of the checkpoint a hot standby node starts up from,
> the sooner that node can reach consistency.  There's no difference for
> the first time a node starts from a basebackup (since we gotta replay
> that checkpoint anyway before we're consistent), but if we start from a
> restartpoint...

Agreed. LogStandbySnapshot() is called after CheckPointGuts() since
its introduction in efc16ea5. This may save time. This would surely be
a master-only optimization though.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

06 November 2015, 07:47:34

On Thu, Nov 5, 2015 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 4, 2015 at 7:33 PM, Andres Freund wrote:
>> As soon as a single checkpoint ever happened the early-return logic in
>> CreateCheckPoint() will fail to take the LogStandbySnapshot() in
>> CreateCheckPoint() into account. The test is:
>>     if (curInsert == ControlFile->checkPoint +
>>         MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
>>         ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
>> which obviously doesn't work if there's been a WAL record logged after
>> the redo pointer has been determined etc.
>
> Yes. If segment switches are enforced at a pace faster than
> checkpoint_timeout, this check considers that a checkpoint needs to
> happen because a SWITCH_XLOG record is in-between. I am a bit
> surprised that this should happen actually. The segment switch
> triggers a checkpoint record, and vice-versa, even for idle systems.
> Shouldn't we make this check a bit smarter then?

Ah, the documentation clearly explains that setting archive_timeout
will enforce a segment switch if any activity occurred, including a
checkpoint:
http://www.postgresql.org/docs/devel/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-ARCHIVING
I missed that, sorry.

I have as well thought a bit about adding a space-related constraint
on the standby snapshot generated by the bgwriter, so as to not rely
entirely on the interval of 15s. I finished with the attached that
uses a check based on CheckPointSegments / 8 to be sure that at least
this number of segments has been generated since the last checkpoint
before logging a new snapshot. I guess that's less brittle than the
last patch. Thoughts?
--
Michael

Attachment

20151106_archive_idle_v3.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

06 November 2015, 16:43:00

On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> I have as well thought a bit about adding a space-related constraint
> on the standby snapshot generated by the bgwriter, so as to not rely
> entirely on the interval of 15s. I finished with the attached that
> uses a check based on CheckPointSegments / 8 to be sure that at least
> this number of segments has been generated since the last checkpoint
> before logging a new snapshot. I guess that's less brittle than the
> last patch. Thoughts?

I can't see why that would be a good idea.  My understanding is that
the logical decoding code needs to get those messages pretty
regularly, and I don't see why that need would be reduced on systems
where CheckPointSegments is large.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

06 November 2015, 16:52:43

On 2015-11-06 11:42:56 -0500, Robert Haas wrote:
> On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > I have as well thought a bit about adding a space-related constraint
> > on the standby snapshot generated by the bgwriter, so as to not rely
> > entirely on the interval of 15s. I finished with the attached that
> > uses a check based on CheckPointSegments / 8 to be sure that at least
> > this number of segments has been generated since the last checkpoint
> > before logging a new snapshot. I guess that's less brittle than the
> > last patch. Thoughts?
> 
> I can't see why that would be a good idea.  My understanding is that
> the logical decoding code needs to get those messages pretty
> regularly, and I don't see why that need would be reduced on systems
> where CheckPointSegments is large.

Precisely.

What I'm thinking of right now is a marker somewhere in shared memory,
that tells whether anything worthwhile has happened since the last
determination of the redo pointer. Where standby snapshots don't
count. That seems like it'd be to maintain going forward than doing
precise size calculations like CreateCheckPoint() already does, and
would additionally need to handle its own standby snapshot, not to speak
of the background ones.

Seems like it'd be doable in ReserveXLogInsertLocation().

Whether it's actually worthwhile I'm not all that sure tho.

Andres

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

06 November 2015, 17:21:55

On Fri, Nov 6, 2015 at 11:52 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-06 11:42:56 -0500, Robert Haas wrote:
>> On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > I have as well thought a bit about adding a space-related constraint
>> > on the standby snapshot generated by the bgwriter, so as to not rely
>> > entirely on the interval of 15s. I finished with the attached that
>> > uses a check based on CheckPointSegments / 8 to be sure that at least
>> > this number of segments has been generated since the last checkpoint
>> > before logging a new snapshot. I guess that's less brittle than the
>> > last patch. Thoughts?
>>
>> I can't see why that would be a good idea.  My understanding is that
>> the logical decoding code needs to get those messages pretty
>> regularly, and I don't see why that need would be reduced on systems
>> where CheckPointSegments is large.
>
> Precisely.
>
> What I'm thinking of right now is a marker somewhere in shared memory,
> that tells whether anything worthwhile has happened since the last
> determination of the redo pointer. Where standby snapshots don't
> count. That seems like it'd be to maintain going forward than doing
> precise size calculations like CreateCheckPoint() already does, and
> would additionally need to handle its own standby snapshot, not to speak
> of the background ones.

Good idea.

> Seems like it'd be doable in ReserveXLogInsertLocation().
>
> Whether it's actually worthwhile I'm not all that sure tho.

Why not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

06 November 2015, 17:26:11

On November 6, 2015 6:21:50 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Fri, Nov 6, 2015 at 11:52 AM, Andres Freund <andres@anarazel.de>
>wrote:
>> Seems like it'd be doable in ReserveXLogInsertLocation().
>>
>> Whether it's actually worthwhile I'm not all that sure tho.
>
>Why not?

Adds another instruction in one of the hottest spinlock protected sections of PG. Probably won't be significant,
but...

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

06 November 2015, 17:27:57

On Fri, Nov 6, 2015 at 12:26 PM, Andres Freund <andres@anarazel.de> wrote:
> On November 6, 2015 6:21:50 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>>On Fri, Nov 6, 2015 at 11:52 AM, Andres Freund <andres@anarazel.de>
>>wrote:
>>> Seems like it'd be doable in ReserveXLogInsertLocation().
>>>
>>> Whether it's actually worthwhile I'm not all that sure tho.
>>
>>Why not?
>
> Adds another instruction in one of the hottest spinlock protected sections of PG. Probably won't be significant,
but...

Oh.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

07 November 2015, 06:54:14

On Sat, Nov 7, 2015 at 1:52 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-06 11:42:56 -0500, Robert Haas wrote:
>> On Fri, Nov 6, 2015 at 2:47 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > I have as well thought a bit about adding a space-related constraint
>> > on the standby snapshot generated by the bgwriter, so as to not rely
>> > entirely on the interval of 15s. I finished with the attached that
>> > uses a check based on CheckPointSegments / 8 to be sure that at least
>> > this number of segments has been generated since the last checkpoint
>> > before logging a new snapshot. I guess that's less brittle than the
>> > last patch. Thoughts?
>>
>> I can't see why that would be a good idea.  My understanding is that
>> the logical decoding code needs to get those messages pretty
>> regularly, and I don't see why that need would be reduced on systems
>> where CheckPointSegments is large.
>
> Precisely.
>
>
> What I'm thinking of right now is a marker somewhere in shared memory,
> that tells whether anything worthwhile has happened since the last
> determination of the redo pointer. Where standby snapshots don't
> count. That seems like it'd be to maintain going forward than doing
> precise size calculations like CreateCheckPoint() already does, and
> would additionally need to handle its own standby snapshot, not to speak
> of the background ones.

I thought about something like that at some point by saving a minimum
activity pointer in XLogCtl, updated each time a segment was forcibly
switched or after inserting a checkpoint record. Then the bgwriter
looked at if the current insert position matched this minimum activity
pointer, skipping LogStandbySnapshot if both positions match. Does
this match your line of thoughts?

> Seems like it'd be doable in ReserveXLogInsertLocation().
> Whether it's actually worthwhile I'm not all that sure tho.

I am not sure doing extra work there is a good idea. There are already
5 instructions within the spinlock section, and this code path is
already high in many profiles for busy systems.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

08 November 2015, 12:50:47

On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> I thought about something like that at some point by saving a minimum
> activity pointer in XLogCtl, updated each time a segment was forcibly
> switched or after inserting a checkpoint record. Then the bgwriter
> looked at if the current insert position matched this minimum activity
> pointer, skipping LogStandbySnapshot if both positions match. Does
> this match your line of thoughts?

Looking at the code, it occurred to me that the LSN position saved for
a XLOG_SWITCH record is the last position of current segment, so we
would still need to check if the current insert LSN matches the
beginning of a new segment and if the last segment was forcibly
switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
Thoughts?
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

20 December 2015, 13:14:12

On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
>> I thought about something like that at some point by saving a minimum
>> activity pointer in XLogCtl, updated each time a segment was forcibly
>> switched or after inserting a checkpoint record. Then the bgwriter
>> looked at if the current insert position matched this minimum activity
>> pointer, skipping LogStandbySnapshot if both positions match. Does
>> this match your line of thoughts?
>
> Looking at the code, it occurred to me that the LSN position saved for
> a XLOG_SWITCH record is the last position of current segment, so we
> would still need to check if the current insert LSN matches the
> beginning of a new segment and if the last segment was forcibly
> switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> Thoughts?

I haven't given up on this patch yet, and putting again my head on
this problem I have finished with the patch attached, which checks if
the current insert LSN position is at the beginning of a segment that
has just been switched to decide if a standby snapshot should be
logged or not. This allows bringing back an idle system to the pre-9.3
state where a segment would be archived in the case of a low
archive_timeout only when a checkpoint has been issued on the system.
In order to achieve this idea I have added a field on XLogCtl that
saves the last time a segment has been forcibly switched after
XLOG_SWITCH.

Honestly I am failing to see why we should track the progress since
last checkpoint as mentioned upthread, and the current behavior is
certainly a regression.

Speaking of which, this patch was registered in this CF, I am moving
it to the next as a bug fix.
Regards,
--
Michael

Attachment

archive-idle-v4.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

21 December 2015, 07:26:31

On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Speaking of which, this patch was registered in this CF, I am moving
> it to the next as a bug fix.

I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it
is possible that the return LSN pointer is on the new segment that has
been forcibly archived if RequestXLogSwitch is called multiple times
and that subsequent calls are not necessary. Patch updated.
--
Michael

Attachment

archive-timeout-v5.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

16 January 2016, 10:10:54

On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> >> I thought about something like that at some point by saving a minimum
> >> activity pointer in XLogCtl, updated each time a segment was forcibly
> >> switched or after inserting a checkpoint record. Then the bgwriter
> >> looked at if the current insert position matched this minimum activity
> >> pointer, skipping LogStandbySnapshot if both positions match. Does
> >> this match your line of thoughts?
> >
> > Looking at the code, it occurred to me that the LSN position saved for
> > a XLOG_SWITCH record is the last position of current segment, so we
> > would still need to check if the current insert LSN matches the
> > beginning of a new segment and if the last segment was forcibly
> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> > Thoughts?
>
> I haven't given up on this patch yet, and putting again my head on
> this problem I have finished with the patch attached, which checks if
> the current insert LSN position is at the beginning of a segment that
> has just been switched to decide if a standby snapshot should be
> logged or not. This allows bringing back an idle system to the pre-9.3
> state where a segment would be archived in the case of a low
> archive_timeout only when a checkpoint has been issued on the system.

Won't this be a problem if the checkpoint occurs after a long time and in

the mean time there is some activity in the server?

Another idea to solve this issue could be to see if there is any progress

in the server by checking buffers dirtied/written (we can refer that

information using pgBufferUsage) since last time we log this record in

bgwriter.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

16 January 2016, 11:38:29

On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
>> >> I thought about something like that at some point by saving a minimum
>> >> activity pointer in XLogCtl, updated each time a segment was forcibly
>> >> switched or after inserting a checkpoint record. Then the bgwriter
>> >> looked at if the current insert position matched this minimum activity
>> >> pointer, skipping LogStandbySnapshot if both positions match. Does
>> >> this match your line of thoughts?
>> >
>> > Looking at the code, it occurred to me that the LSN position saved for
>> > a XLOG_SWITCH record is the last position of current segment, so we
>> > would still need to check if the current insert LSN matches the
>> > beginning of a new segment and if the last segment was forcibly
>> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
>> > Thoughts?
>>
>> I haven't given up on this patch yet, and putting again my head on
>> this problem I have finished with the patch attached, which checks if
>> the current insert LSN position is at the beginning of a segment that
>> has just been switched to decide if a standby snapshot should be
>> logged or not. This allows bringing back an idle system to the pre-9.3
>> state where a segment would be archived in the case of a low
>> archive_timeout only when a checkpoint has been issued on the system.
>>
>
> Won't this be a problem if the checkpoint occurs after a long time and in
> the mean time there is some activity in the server?

Why? If there is some activity on the server, the snapshot will be
immediately taken at the next iteration without caring about the
checkpoint.

> Another idea to solve this issue could be to see if there is any progress
> in the server by checking buffers dirtied/written (we can refer that
> information using pgBufferUsage) since last time we log this record in
> bgwriter.

Yes, that may be an idea worth considering, but I really think that we
had better measure that at WAL level..
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

16 January 2016, 12:07:46

On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >> On Sun, Nov 8, 2015 at 9:50 PM, Michael Paquier
> >> <michael.paquier@gmail.com> wrote:
> >> > On Sat, Nov 7, 2015 at 3:54 PM, Michael Paquier wrote:
> >> >> I thought about something like that at some point by saving a minimum
> >> >> activity pointer in XLogCtl, updated each time a segment was forcibly
> >> >> switched or after inserting a checkpoint record. Then the bgwriter
> >> >> looked at if the current insert position matched this minimum activity
> >> >> pointer, skipping LogStandbySnapshot if both positions match. Does
> >> >> this match your line of thoughts?
> >> >
> >> > Looking at the code, it occurred to me that the LSN position saved for
> >> > a XLOG_SWITCH record is the last position of current segment, so we
> >> > would still need to check if the current insert LSN matches the
> >> > beginning of a new segment and if the last segment was forcibly
> >> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for example.
> >> > Thoughts?
> >>
> >> I haven't given up on this patch yet, and putting again my head on
> >> this problem I have finished with the patch attached, which checks if
> >> the current insert LSN position is at the beginning of a segment that
> >> has just been switched to decide if a standby snapshot should be
> >> logged or not. This allows bringing back an idle system to the pre-9.3
> >> state where a segment would be archived in the case of a low
> >> archive_timeout only when a checkpoint has been issued on the system.
> >>
> >
> > Won't this be a problem if the checkpoint occurs after a long time and in
> > the mean time there is some activity in the server?
>
> Why? If there is some activity on the server, the snapshot will be
> immediately taken at the next iteration without caring about the
> checkpoint.
>

+ (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))

Do you mean to intend that it is protected by above check in the

patch?

Isn't it possible that so much WAL is inserted between bgwriter cycles,

that when it checks the location of WAL, it founds it to be at the beginning

of a new segment?

> > Another idea to solve this issue could be to see if there is any progress
> > in the server by checking buffers dirtied/written (we can refer that
> > information using pgBufferUsage) since last time we log this record in
> > bgwriter.
>
> Yes, that may be an idea worth considering, but I really think that we
> had better measure that at WAL level..
>

I thought this is quite close to the previous patch you proposed where

Andres wanted some measurement in terms of progress since last

checkpoint. I understand as per current code your patch can work, but

what if some more similar WAL records needs to be added?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

16 January 2016, 13:07:43

On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Sun, Dec 20, 2015 at 6:44 PM, Michael Paquier
>> > <michael.paquier@gmail.com> wrote:
>> >> > Looking at the code, it occurred to me that the LSN position saved
>> >> > for
>> >> > a XLOG_SWITCH record is the last position of current segment, so we
>> >> > would still need to check if the current insert LSN matches the
>> >> > beginning of a new segment and if the last segment was forcibly
>> >> > switched by saving RecPtr of RequestXLogSwitch in XLogCtl for
>> >> > example.
>> >> > Thoughts?
>> >>
>> >> I haven't given up on this patch yet, and putting again my head on
>> >> this problem I have finished with the patch attached, which checks if
>> >> the current insert LSN position is at the beginning of a segment that
>> >> has just been switched to decide if a standby snapshot should be
>> >> logged or not. This allows bringing back an idle system to the pre-9.3
>> >> state where a segment would be archived in the case of a low
>> >> archive_timeout only when a checkpoint has been issued on the system.
>> >>
>> >
>> > Won't this be a problem if the checkpoint occurs after a long time and
>> > in
>> > the mean time there is some activity in the server?
>>
>> Why? If there is some activity on the server, the snapshot will be
>> immediately taken at the next iteration without caring about the
>> checkpoint.
>>
>
> + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))
>
> Do you mean to intend that it is protected by above check in the
> patch?

Yep, in combination with is_switch_current to check if the last
completed segment was forcibly switched.

> Isn't it possible that so much WAL is inserted between bgwriter cycles,
> that when it checks the location of WAL, it founds it to be at the beginning
> of a new segment?

Er, no. Even if the insert LSN is at the beginning of a new segment,
this would take a standby snapshot if the last segment was not
forcibly switched.

>> > Another idea to solve this issue could be to see if there is any
>> > progress
>> > in the server by checking buffers dirtied/written (we can refer that
>> > information using pgBufferUsage) since last time we log this record in
>> > bgwriter.
>>
>> Yes, that may be an idea worth considering, but I really think that we
>> had better measure that at WAL level..
>
> I thought this is quite close to the previous patch you proposed where
> Andres wanted some measurement in terms of progress since last
> checkpoint. I understand as per current code your patch can work, but
> what if some more similar WAL records needs to be added?

A record like this one for the bgwriter? We could still rely on the
same check tracking the last-forced-segment, no?

It seems to me that tracking progress here is not really only a matter
of the number of shared buffers dirtied or written, we would need as
well to track XLOG_STANDBY_LOCK and provide for them fresh snapshots.
Imagine for example a read-only load on the master with some exclusive
locks taken on relations from time to time (perhaps unlikely so but
who knows?).
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

17 January 2016, 04:38:05

On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > Won't this be a problem if the checkpoint occurs after a long time and
> >> > in
> >> > the mean time there is some activity in the server?
> >>
> >> Why? If there is some activity on the server, the snapshot will be
> >> immediately taken at the next iteration without caring about the
> >> checkpoint.
> >>
> >
> > + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))
> >
> > Do you mean to intend that it is protected by above check in the
> > patch?
>
> Yep, in combination with is_switch_current to check if the last
> completed segment was forcibly switched.
>
> > Isn't it possible that so much WAL is inserted between bgwriter cycles,
> > that when it checks the location of WAL, it founds it to be at the beginning
> > of a new segment?
>
> Er, no. Even if the insert LSN is at the beginning of a new segment,
> this would take a standby snapshot if the last segment was not
> forcibly switched.
>

So here if I understand correctly the check related to the last segment

forcibly switched is based on the fact that if it is forcibly switched, then

we don't need to log this record? What is the reason of such an

assumption? It is not very clear by reading the comments you have

added in patch, may be you can expand comments slightly to explain

the reason of same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

18 January 2016, 05:24:10

On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Sat, Jan 16, 2016 at 9:07 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Sat, Jan 16, 2016 at 5:08 PM, Michael Paquier
>> > <michael.paquier@gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Jan 16, 2016 at 7:10 PM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >> > Won't this be a problem if the checkpoint occurs after a long time
>> >> > and
>> >> > in
>> >> > the mean time there is some activity in the server?
>> >>
>> >> Why? If there is some activity on the server, the snapshot will be
>> >> immediately taken at the next iteration without caring about the
>> >> checkpoint.
>> >>
>> >
>> > + (insert_lsn % XLOG_SEG_SIZE) != SizeOfXLogLongPHD))
>> >
>> > Do you mean to intend that it is protected by above check in the
>> > patch?
>>
>> Yep, in combination with is_switch_current to check if the last
>> completed segment was forcibly switched.
>>
>> > Isn't it possible that so much WAL is inserted between bgwriter cycles,
>> > that when it checks the location of WAL, it founds it to be at the
>> > beginning
>> > of a new segment?
>>
>> Er, no. Even if the insert LSN is at the beginning of a new segment,
>> this would take a standby snapshot if the last segment was not
>> forcibly switched.
>>
>
> So here if I understand correctly the check related to the last segment
> forcibly switched is based on the fact that if it is forcibly switched, then
> we don't need to log this record? What is the reason of such an
> assumption?

Yes, the thing is that, as mentioned at the beginning of the thread, a
low value of archive_timeout causes a segment to be forcibly switched
at the end of the timeout defined by this parameter. In combination
with the standby snapshot taken in bgwriter since 9.4, this is causing
segments to be always switched even if a system is completely idle.
Before that, in 9.3 and older versions, we would have a segment
forcibly switched on an idle system only once per checkpoint. The
documentation is already clear about that actually. The whole point of
this patch is to fix this regression, aka switch to a new segment only
once per checkpoint.

> It is not very clear by reading the comments you have
> added in patch, may be you can expand comments slightly to explain
> the reason of same.

OK. Here are the comments that this patch is adding:
-             * only log if enough time has passed and some xlog record has
-             * been inserted.
+             * Only log if enough time has passed and some xlog record has
+             * been inserted on a new segment. On an idle system where
+             * segments can be archived in a fast pace with for example a
+             * low archive_command setting, avoid as well logging a new
+             * standby snapshot if the current insert position is still
+             * at the beginning of the segment that has just been switched.
+             *
+             * It is possible that GetXLogLastSwitchPtr points to the last
+             * position of previous segment or to the first position of the
+             * new segment after the switch, hence take both cases into
+             * account when deciding if a standby snapshot should be taken.
+             * (see comments on top of RequestXLogSwitch for more details).             */
It makes mention of the problem that it is trying to fix, perhaps you
mean that this should mention the fact that on an idle system standby
snapshots should happen at worst once per checkpoint?
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

18 January 2016, 13:19:46

On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >>
> >
> > So here if I understand correctly the check related to the last segment
> > forcibly switched is based on the fact that if it is forcibly switched, then
> > we don't need to log this record? What is the reason of such an
> > assumption?
>
> Yes, the thing is that, as mentioned at the beginning of the thread, a
> low value of archive_timeout causes a segment to be forcibly switched
> at the end of the timeout defined by this parameter. In combination
> with the standby snapshot taken in bgwriter since 9.4, this is causing
> segments to be always switched even if a system is completely idle.
> Before that, in 9.3 and older versions, we would have a segment
> forcibly switched on an idle system only once per checkpoint.
>

Okay, so this will fix the case where the system is idle, but what I

am slightly worried is that it should not miss to log the standby snapshot

due to this check when there is actually some activity in the system.

Why is not possible to have a case such that the segment is forcibly

switched due to reason other than checkpoint (user has requested the

same) and the current insert LSN is at beginning of a new segment

due to write activity? If that is possible, which to me theoretically seems

possible, then I think bgwriter will miss to log the standby snapshot.

>
> The
> documentation is already clear about that actually. The whole point of
> this patch is to fix this regression, aka switch to a new segment only
> once per checkpoint.
>
> > It is not very clear by reading the comments you have
> > added in patch, may be you can expand comments slightly to explain
> > the reason of same.
>
> OK. Here are the comments that this patch is adding:
> - * only log if enough time has passed and some xlog record has
> - * been inserted.
> + * Only log if enough time has passed and some xlog record has
> + * been inserted on a new segment. On an idle system where
> + * segments can be archived in a fast pace with for example a
> + * low archive_command setting, avoid as well logging a new
> + * standby snapshot if the current insert position is still
> + * at the beginning of the segment that has just been switched.
> + *
> + * It is possible that GetXLogLastSwitchPtr points to the last
> + * position of previous segment or to the first position of the
> + * new segment after the switch, hence take both cases into
> + * account when deciding if a standby snapshot should be taken.
> + * (see comments on top of RequestXLogSwitch for more details).
> */
> It makes mention of the problem that it is trying to fix, perhaps you
> mean that this should mention the fact that on an idle system standby
> snapshots should happen at worst once per checkpoint?
>

I mean to say that in below part of comment, explanation about the

the check related to insert position is quite clear whereas why it is

okay to avoid logging standby snapshot when the segment is not

forcibly switched is not apparent.

* avoid as well logging a new
* standby snapshot if the current insert position is still
* at the beginning of the segment that has just been switched.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

18 January 2016, 13:38:36

On 2015-12-21 16:26:25 +0900, Michael Paquier wrote:
> On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Speaking of which, this patch was registered in this CF, I am moving
> > it to the next as a bug fix.
> 
> I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it
> is possible that the return LSN pointer is on the new segment that has
> been forcibly archived if RequestXLogSwitch is called multiple times
> and that subsequent calls are not necessary. Patch updated.

I find this patch rather unsatisfactory. Yes, it kinda solves the
problem of archive timeout, but it leaves the bigger and longer standing
problems of unneccessary checkpoints with wal_level=hs in place. It's
also somewhat fragile in my opinion.

I think we rather want a per backend (or perhaps per wal insertion lock)
flag that says 'last relevant record inserted at', and a way to not set
that during insertion. Then during a checkpoint or the relevant bgwriter
code, we look wether anything relevant happened in any backend since the
last time we performed a checkpoint/logged a running xacts snapshot.

Greetings,

Andres Freund

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

19 January 2016, 04:28:38

On Mon, Jan 18, 2016 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-12-21 16:26:25 +0900, Michael Paquier wrote:
> > On Sun, Dec 20, 2015 at 10:14 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> > > Speaking of which, this patch was registered in this CF, I am moving
> > > it to the next as a bug fix.
> >
> > I found a stupid bug in my previous patch: when issuing XLOG_SWITCH it
> > is possible that the return LSN pointer is on the new segment that has
> > been forcibly archived if RequestXLogSwitch is called multiple times
> > and that subsequent calls are not necessary. Patch updated.
>
> I find this patch rather unsatisfactory. Yes, it kinda solves the
> problem of archive timeout, but it leaves the bigger and longer standing
> problems of unneccessary checkpoints with wal_level=hs in place. It's
> also somewhat fragile in my opinion.
>
> I think we rather want a per backend (or perhaps per wal insertion lock)
> flag that says 'last relevant record inserted at', and a way to not set
> that during insertion. Then during a checkpoint or the relevant bgwriter
> code, we look wether anything relevant happened in any backend since the
> last time we performed a checkpoint/logged a running xacts snapshot.
>

Sounds to be a more robust way of dealing with this problem. Michael,

if you don't disagree with above proposal, then we can mark this patch

as Waiting on Author?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

19 January 2016, 07:11:55

On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>> On Sun, Jan 17, 2016 at 1:37 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Sat, Jan 16, 2016 at 6:37 PM, Michael Paquier
>> > <michael.paquier@gmail.com>
>> > wrote:
>> >>
>> >>
>> >>
>> >
>> > So here if I understand correctly the check related to the last segment
>> > forcibly switched is based on the fact that if it is forcibly switched,
>> > then
>> > we don't need to log this record? What is the reason of such an
>> > assumption?
>>
>> Yes, the thing is that, as mentioned at the beginning of the thread, a
>> low value of archive_timeout causes a segment to be forcibly switched
>> at the end of the timeout defined by this parameter. In combination
>> with the standby snapshot taken in bgwriter since 9.4, this is causing
>> segments to be always switched even if a system is completely idle.
>> Before that, in 9.3 and older versions, we would have a segment
>> forcibly switched on an idle system only once per checkpoint.
>>
>
> Okay, so this will fix the case where the system is idle, but what I
> am slightly worried is that it should not miss to log the standby snapshot
> due to this check when there is actually some activity in the system.
> Why is not possible to have a case such that the segment is forcibly
> switched due to reason other than checkpoint (user has requested the
> same) and the current insert LSN is at beginning of a new segment
> due to write activity? If that is possible, which to me theoretically seems
> possible, then I think bgwriter will miss to log the standby snapshot.

Yes, with segments forcibly switched by users this check would make
the bgwriter not log in a snapshot if the last action done by server
was XLOG_SWITCH. Based on the patch I sent, the only alternative would
be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine
enough for back-branches..

>> The
>> documentation is already clear about that actually. The whole point of
>> this patch is to fix this regression, aka switch to a new segment only
>> once per checkpoint.
>>
>> > It is not very clear by reading the comments you have
>> > added in patch, may be you can expand comments slightly to explain
>> > the reason of same.
>>
>> OK. Here are the comments that this patch is adding:
>> -             * only log if enough time has passed and some xlog record
>> has
>> -             * been inserted.
>> +             * Only log if enough time has passed and some xlog record
>> has
>> +             * been inserted on a new segment. On an idle system where
>> +             * segments can be archived in a fast pace with for example a
>> +             * low archive_command setting, avoid as well logging a new
>> +             * standby snapshot if the current insert position is still
>> +             * at the beginning of the segment that has just been
>> switched.
>> +             *
>> +             * It is possible that GetXLogLastSwitchPtr points to the
>> last
>> +             * position of previous segment or to the first position of
>> the
>> +             * new segment after the switch, hence take both cases into
>> +             * account when deciding if a standby snapshot should be
>> taken.
>> +             * (see comments on top of RequestXLogSwitch for more
>> details).
>>               */
>> It makes mention of the problem that it is trying to fix, perhaps you
>> mean that this should mention the fact that on an idle system standby
>> snapshots should happen at worst once per checkpoint?
>>
>
> I mean to say that in below part of comment, explanation about the
> the check related to insert position is quite clear whereas why it is
> okay to avoid logging standby snapshot when the segment is not
> forcibly switched is not apparent.
>
> * avoid as well logging a new
> * standby snapshot if the current insert position is still
> * at the beginning of the segment that has just been switched.

Perhaps: "avoid as well logging a new standby snapshot if the current
insert position is at the beginning of a segment that has been
*forcibly* switched at checkpoint or by archive_timeout"?
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

19 January 2016, 07:15:03

On Tue, Jan 19, 2016 at 1:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 18, 2016 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
>> I find this patch rather unsatisfactory. Yes, it kinda solves the
>> problem of archive timeout, but it leaves the bigger and longer standing
>> problems of unneccessary checkpoints with wal_level=hs in place. It's
>> also somewhat fragile in my opinion.

Check.

>> I think we rather want a per backend (or perhaps per wal insertion lock)
>> flag that says 'last relevant record inserted at', and a way to not set
>> that during insertion. Then during a checkpoint or the relevant bgwriter
>> code, we look wether anything relevant happened in any backend since the
>> last time we performed a checkpoint/logged a running xacts snapshot.

And in this case, the last relevant record would be caused by a forced
segment switch or a checkpoint record, right? Doing that per WAL
insertion lock seems more scalable to me. I haven't looked at the code
yet though to see how that would work out.

> Sounds to be a more robust way of dealing with this problem.  Michael,
> if you don't disagree with above proposal, then we can mark this patch
> as Waiting on Author?

Yeah let's do so. I'll think more about this thing.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

19 January 2016, 08:10:29

On Tue, Jan 19, 2016 at 12:41 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> Yes, the thing is that, as mentioned at the beginning of the thread, a
> >> low value of archive_timeout causes a segment to be forcibly switched
> >> at the end of the timeout defined by this parameter. In combination
> >> with the standby snapshot taken in bgwriter since 9.4, this is causing
> >> segments to be always switched even if a system is completely idle.
> >> Before that, in 9.3 and older versions, we would have a segment
> >> forcibly switched on an idle system only once per checkpoint.
> >>
> >
> > Okay, so this will fix the case where the system is idle, but what I
> > am slightly worried is that it should not miss to log the standby snapshot
> > due to this check when there is actually some activity in the system.
> > Why is not possible to have a case such that the segment is forcibly
> > switched due to reason other than checkpoint (user has requested the
> > same) and the current insert LSN is at beginning of a new segment
> > due to write activity? If that is possible, which to me theoretically seems
> > possible, then I think bgwriter will miss to log the standby snapshot.
>
> Yes, with segments forcibly switched by users this check would make
> the bgwriter not log in a snapshot if the last action done by server
> was XLOG_SWITCH. Based on the patch I sent, the only alternative would
> be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine
> enough for back-branches..
>

Yeah, that can work, but I think the code won't look too clean. I think

lets try out something on lines of what is suggested by Andres and

see how it turns out.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

28 January 2016, 07:40:20

On Tue, Jan 19, 2016 at 5:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 19, 2016 at 12:41 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>> On Mon, Jan 18, 2016 at 10:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Mon, Jan 18, 2016 at 10:54 AM, Michael Paquier
>> > <michael.paquier@gmail.com> wrote:
>> >> Yes, the thing is that, as mentioned at the beginning of the thread, a
>> >> low value of archive_timeout causes a segment to be forcibly switched
>> >> at the end of the timeout defined by this parameter. In combination
>> >> with the standby snapshot taken in bgwriter since 9.4, this is causing
>> >> segments to be always switched even if a system is completely idle.
>> >> Before that, in 9.3 and older versions, we would have a segment
>> >> forcibly switched on an idle system only once per checkpoint.
>> >>
>> >
>> > Okay, so this will fix the case where the system is idle, but what I
>> > am slightly worried is that it should not miss to log the standby
>> > snapshot
>> > due to this check when there is actually some activity in the system.
>> > Why is not possible to have a case such that the segment is forcibly
>> > switched due to reason other than checkpoint (user has requested the
>> > same) and the current insert LSN is at beginning of a new segment
>> > due to write activity? If that is possible, which to me theoretically
>> > seems
>> > possible, then I think bgwriter will miss to log the standby snapshot.
>>
>> Yes, with segments forcibly switched by users this check would make
>> the bgwriter not log in a snapshot if the last action done by server
>> was XLOG_SWITCH. Based on the patch I sent, the only alternative would
>> be to not forcedSegSwitchLSN in those code paths. Perhaps that's fine
>> enough for back-branches..
>>
>
> Yeah, that can work, but I think the code won't look too clean.  I think
> lets try out something on lines of what is suggested by Andres and
> see how it turns out.

OK, so as a first step and after thinking about the whole for a while,
I have finished with the patch attached. This patch is aimed at
avoiding unnecessary checkpoints on idle systems when wal_level >=
hot_standby by centralizing the check to look at if there has some WAL
activity since the last checkpoint. This way, the system does not
generate anymore standby-related records if no activity has occurred.
I think that this is a good step forward, still it does not address
yet the problem of segments forcibly switched, particularly when
archive_timeout has a low value.

In order to address the other problem with switched segments, I think
that we could extend the routine XLOGHasActivity that the patch
attached introduces with some more fine-grained checks. After looking
at the code I think as well that we had better save into shared memory
the last LSN position that was forcibly switched and make use of that
in the bgwriter.

So, thoughts about the attached?
--
Michael

Attachment

hs-checkpoints-v1.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

29 January 2016, 08:13:35

On 2016-01-28 16:40:13 +0900, Michael Paquier wrote:
> OK, so as a first step and after thinking about the whole for a while,
> I have finished with the patch attached. This patch is aimed at
> avoiding unnecessary checkpoints on idle systems when wal_level >=
> hot_standby by centralizing the check to look at if there has some WAL
> activity since the last checkpoint.

That's not what I suggested.

>  /*
> + * XLOGHasActivity -- Check if XLOG had some significant activity or
> + * if it is idle lately. This is primarily used to check if there has
> + * been some WAL activity since the last checkpoint that occurred on
> + * system to control the generaton of XLOG record related to standbys.
> + */
> +bool
> +XLOGHasActivity(void)
> +{
> +    XLogCtlInsert *Insert = &XLogCtl->Insert;
> +    XLogRecPtr    redo_lsn = ControlFile->checkPointCopy.redo;
> +    uint64        prev_bytepos;
> +
> +    /* Check if any activity has happened since last checkpoint */
> +    SpinLockAcquire(&Insert->insertpos_lck);
> +    prev_bytepos = Insert->PrevBytePos;
> +    SpinLockRelease(&Insert->insertpos_lck);
> +
> +    return XLogBytePosToRecPtr(prev_bytepos) == redo_lsn;
> +}
>

How should this actually should work reliably, given we *want* to have
included a standby snapshot after the last checkpoint?

In CreateCheckPoint() we have/* * Here we update the shared RedoRecPtr for future XLogInsert calls; this * must be done
whileholding all the insertion locks. *RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
 

computing the next redo rec ptr and thenif (!shutdown && XLogStandbyInfoActive())    LogStandbySnapshot();
before the finalXLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));recptr = XLogInsert(RM_XLOG_ID,
       shutdown ? XLOG_CHECKPOINT_SHUTDOWN :                    XLOG_CHECKPOINT_ONLINE);
 

so the above condition doesn't really something we want to rely on. Am I
missing what you're trying to do?

Greetings,

Andres Freund

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

29 January 2016, 12:13:15

On Fri, Jan 29, 2016 at 5:13 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-28 16:40:13 +0900, Michael Paquier wrote:
>> OK, so as a first step and after thinking about the whole for a while,
>> I have finished with the patch attached. This patch is aimed at
>> avoiding unnecessary checkpoints on idle systems when wal_level >=
>> hot_standby by centralizing the check to look at if there has some WAL
>> activity since the last checkpoint.
>
> That's not what I suggested.
>
>>  /*
>> + * XLOGHasActivity -- Check if XLOG had some significant activity or
>> + * if it is idle lately. This is primarily used to check if there has
>> + * been some WAL activity since the last checkpoint that occurred on
>> + * system to control the generaton of XLOG record related to standbys.
>> + */
>> +bool
>> +XLOGHasActivity(void)
>> +{
>> +     XLogCtlInsert *Insert = &XLogCtl->Insert;
>> +     XLogRecPtr      redo_lsn = ControlFile->checkPointCopy.redo;
>> +     uint64          prev_bytepos;
>> +
>> +     /* Check if any activity has happened since last checkpoint */
>> +     SpinLockAcquire(&Insert->insertpos_lck);
>> +     prev_bytepos = Insert->PrevBytePos;
>> +     SpinLockRelease(&Insert->insertpos_lck);
>> +
>> +     return XLogBytePosToRecPtr(prev_bytepos) == redo_lsn;
>> +}
>>
>
> How should this actually should work reliably, given we *want* to have
> included a standby snapshot after the last checkpoint?
>
> In CreateCheckPoint() we have
>         /*
>          * Here we update the shared RedoRecPtr for future XLogInsert calls; this
>          * must be done while holding all the insertion locks.
>          *
>         RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
>
> computing the next redo rec ptr and then
>         if (!shutdown && XLogStandbyInfoActive())
>                 LogStandbySnapshot();
> before the final
>         XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
>         recptr = XLogInsert(RM_XLOG_ID,
>                                                 shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
>                                                 XLOG_CHECKPOINT_ONLINE);
>
> so the above condition doesn't really something we want to rely on. Am I
> missing what you're trying to do?

Well, to put it short, I am just trying to find a way to make the
backend skip unnecessary checkpoints on an idle system, which results
in the following WAL pattern if system is completely idle:
CHECKPOINT_ONLINE
RUNNING_XACTS
RUNNING_XACTS
[etc..]

The thing is that I am lost with the meaning of this condition to
decide if a checkpoint should be skipped or not:       if (prevPtr == ControlFile->checkPointCopy.redo &&
prevPtr/ XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)       {           WALInsertLockRelease();
LWLockRelease(CheckpointLock);          return;       }

As at least one standby snapshot is logged before the checkpoint
record, the redo position is never going to match the previous insert
LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
Skipping such unnecessary checkpoints is what you would like to
address, no? Because that's what I would like to do here first. And
once we got that right, we could think about addressing the case where
WAL segments are forcibly archived for idle systems.

Perhaps I simply do not grab the meaning of what you defined as
"relevant LSN" upthread. As I understand it, it would be a LSN
position marker in shared memory that we could use for some
decision-making regarding if a checkpoint or a standby snapshot record
should be generated. I have for example played with an array in
XLogCtl->Insert made of NUM_XLOGINSERT_LOCKS elements, each one
protected by one insert lock that tracked the last insert LSN of a
checkpoint record (I did that in XLogInsertRecord because XLogRecData
makes that easy), then I used that to do this decision making in
CreateCheckpoint() and in the bgwriter to decide if a standby snapshot
should be logged or not.
But then I noticed that it would be actually easier and cleaner to do
this decision making using directly the checkpoint redo LSN and
compare that with the previous insert LSN, then use that for the
bgwriter code path, resulting in the WIP I just sent previously.

Is what I am trying to do here biased with what you have in mind? Do
we have a different meaning of the problem in mind? What are your
ideas regarding this relevant LSN?
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

30 January 2016, 14:08:41

On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
> Well, to put it short, I am just trying to find a way to make the
> backend skip unnecessary checkpoints on an idle system, which results
> in the following WAL pattern if system is completely idle:
> CHECKPOINT_ONLINE
> RUNNING_XACTS
> RUNNING_XACTS
> [etc..]
>
> The thing is that I am lost with the meaning of this condition to
> decide if a checkpoint should be skipped or not:
>         if (prevPtr == ControlFile->checkPointCopy.redo &&
>             prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>         {
>             WALInsertLockRelease();
>             LWLockRelease(CheckpointLock);
>             return;
>         }
> As at least one standby snapshot is logged before the checkpoint
> record, the redo position is never going to match the previous insert
> LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
> Skipping such unnecessary checkpoints is what you would like to
> address, no? Because that's what I would like to do here first. And
> once we got that right, we could think about addressing the case where
> WAL segments are forcibly archived for idle systems.

I have put a bit more of brain power into that, and finished with the
patch attached. A new field called chkpProgressLSN is attached to
XLogCtl, which is to the current insert position of the checkpoint
when wal_level <= archive, or the LSN position of the standby snapshot
taken after a checkpoint. The bgwriter code is modified as well so as
it uses this progress LSN and compares it with the current insert LSN
to determine if a standby snapshot should be logged or not. On an
instance of Postgres completely idle, no useless checkpoints, and no
useless standby snapshots are generated anymore.
--
Michael

Attachment

hs-checkpoints-v2.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

31 January 2016, 12:29:53

On Sat, Jan 30, 2016 at 11:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
>> Well, to put it short, I am just trying to find a way to make the
>> backend skip unnecessary checkpoints on an idle system, which results
>> in the following WAL pattern if system is completely idle:
>> CHECKPOINT_ONLINE
>> RUNNING_XACTS
>> RUNNING_XACTS
>> [etc..]
>>
>> The thing is that I am lost with the meaning of this condition to
>> decide if a checkpoint should be skipped or not:
>>         if (prevPtr == ControlFile->checkPointCopy.redo &&
>>             prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>>         {
>>             WALInsertLockRelease();
>>             LWLockRelease(CheckpointLock);
>>             return;
>>         }
>> As at least one standby snapshot is logged before the checkpoint
>> record, the redo position is never going to match the previous insert
>> LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
>> Skipping such unnecessary checkpoints is what you would like to
>> address, no? Because that's what I would like to do here first. And
>> once we got that right, we could think about addressing the case where
>> WAL segments are forcibly archived for idle systems.
>
> I have put a bit more of brain power into that, and finished with the
> patch attached. A new field called chkpProgressLSN is attached to
> XLogCtl, which is to the current insert position of the checkpoint
> when wal_level <= archive, or the LSN position of the standby snapshot
> taken after a checkpoint. The bgwriter code is modified as well so as
> it uses this progress LSN and compares it with the current insert LSN
> to determine if a standby snapshot should be logged or not. On an
> instance of Postgres completely idle, no useless checkpoints, and no
> useless standby snapshots are generated anymore.

Attached is an additional patch that I have used for my tests (should
have sent that yesterday). This just shows up a couple of logs in the
bgwriter and around checkpoint to see in more details their activity
with not that much noise.
--
Michael

Attachment

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

02 February 2016, 04:42:30

On Sat, Jan 30, 2016 at 7:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
> > Well, to put it short, I am just trying to find a way to make the
> > backend skip unnecessary checkpoints on an idle system, which results
> > in the following WAL pattern if system is completely idle:
> > CHECKPOINT_ONLINE
> > RUNNING_XACTS
> > RUNNING_XACTS
> > [etc..]
> >
> > The thing is that I am lost with the meaning of this condition to
> > decide if a checkpoint should be skipped or not:
> > if (prevPtr == ControlFile->checkPointCopy.redo &&
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> > {
> > WALInsertLockRelease();
> > LWLockRelease(CheckpointLock);
> > return;
> > }
> > As at least one standby snapshot is logged before the checkpoint
> > record, the redo position is never going to match the previous insert
> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
> > Skipping such unnecessary checkpoints is what you would like to
> > address, no? Because that's what I would like to do here first. And
> > once we got that right, we could think about addressing the case where
> > WAL segments are forcibly archived for idle systems.
>
> I have put a bit more of brain power into that, and finished with the
> patch attached. A new field called chkpProgressLSN is attached to
> XLogCtl, which is to the current insert position of the checkpoint
> when wal_level <= archive, or the LSN position of the standby snapshot
> taken after a checkpoint. The bgwriter code is modified as well so as
> it uses this progress LSN and compares it with the current insert LSN
> to determine if a standby snapshot should be logged or not. On an
> instance of Postgres completely idle, no useless checkpoints, and no
> useless standby snapshots are generated anymore.
>

@@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)

if ((flags & (CHECKPOINT_IS_SHUTDOWN |

CHECKPOINT_END_OF_RECOVERY |

CHECKPOINT_FORCE)) == 0)

{

- if

(prevPtr == ControlFile->checkPointCopy.redo &&

+ if (GetProgressRecPtr() == prevPtr &&

prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)

{

I think such a check won't consider all the WAL-activity happened

during checkpoint (after checkpoint start, till it finishes) which was

not the intention of this code without your patch.

I think both this and previous patch (hs-checkpoints-v1 ) approach

won't fix the issue in all kind of scenario's.

Let me try to explain what I think should fix this issue based on

discussion above, suggestions by Andres and some of my own

thoughts:

Have a new variable in WALInsertLock that would indicate the last

insertion position (last_insert_pos) we write to after holding that lock.

Ensure that we don't update last_insert_pos while logging standby

snapshot (simple way is to pass a flag in XLogInsert or distinguish

it based on type of record (XLOG_RUNNING_XACTS) or if you can

think of a more smarter way). Now both during checkpoint and

in bgwriter, to find out whether there is any activity since last

time, we need to check all the WALInsertLocks for latest insert position

(by referring last_insert_pos) and compare it with the latest position

we have noted during last such check and decide whether to proceed

or not based on comparison result. If you think it is not adequate to

store last_insert_pos in WALInsertLock, then we can think of having

it in PGPROC.

Yet another idea that occurs to me this morning is that why not

have a variable in shared memory in XLogCtlInsert or XLogCtl

similar to CurrBytePos/PrevBytePos which will be updated on

each XLOGInsert apart from the XLOGInsert for standby snapshots

and use that in a patch somewhat close to what you have in

hs-checkpoints-v1.

One related point is don't we need to bother about activity on

unlogged relations for checkpoints to occur, considering the

case when the only activity happened after last checkpoint

is on unlogged relations?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

03 February 2016, 14:19:44

On 2016-02-02 10:12:25 +0530, Amit Kapila wrote:
> @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
>   if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> CHECKPOINT_END_OF_RECOVERY |
>    CHECKPOINT_FORCE)) == 0)
>   {
> - if
> (prevPtr == ControlFile->checkPointCopy.redo &&
> + if (GetProgressRecPtr() == prevPtr &&
> 
> prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>   {
> 
> I think such a check won't consider all the WAL-activity happened
> during checkpoint (after checkpoint start, till it finishes) which was
> not the intention of this code without your patch.

Precisely.

> I think both this and previous patch (hs-checkpoints-v1 ) approach
> won't fix the issue in all kind of scenario's.

Agreed.

> Let me try to explain what I think should fix this issue based on
> discussion above, suggestions by Andres and some of my own
> thoughts:

> Have a new variable in WALInsertLock that would indicate the last
> insertion position (last_insert_pos) we write to after holding that lock.
> Ensure that we don't update last_insert_pos while logging standby
> snapshot (simple way is to pass a flag in XLogInsert or distinguish
> it based on type of record (XLOG_RUNNING_XACTS) or if you can
> think of a more smarter way).  Now both during checkpoint and
> in bgwriter, to find out whether there is any activity since last
> time, we need to check all the WALInsertLocks for latest insert position
> (by referring last_insert_pos) and compare it with the latest position
> we have noted during last such check and decide whether to proceed
> or not based on comparison result.  If you think it is not adequate to
> store last_insert_pos in WALInsertLock, then we can think of having
> it in PGPROC.

Yes, that's pretty much what I was thinking of.

> Yet another idea that occurs to me this morning is that why not
> have a variable in shared memory in XLogCtlInsert or XLogCtl
> similar to CurrBytePos/PrevBytePos which will be updated on
> each XLOGInsert apart from the XLOGInsert for standby snapshots
> and use that in a patch somewhat close to what you have in
> hs-checkpoints-v1.

That'll require holding locks longer...

Greetings,

Andres Freund

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

03 February 2016, 14:28:23

On Tue, Feb 2, 2016 at 1:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jan 30, 2016 at 7:38 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
>> > Well, to put it short, I am just trying to find a way to make the
>> > backend skip unnecessary checkpoints on an idle system, which results
>> > in the following WAL pattern if system is completely idle:
>> > CHECKPOINT_ONLINE
>> > RUNNING_XACTS
>> > RUNNING_XACTS
>> > [etc..]
>> >
>> > The thing is that I am lost with the meaning of this condition to
>> > decide if a checkpoint should be skipped or not:
>> >         if (prevPtr == ControlFile->checkPointCopy.redo &&
>> >             prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>> >         {
>> >             WALInsertLockRelease();
>> >             LWLockRelease(CheckpointLock);
>> >             return;
>> >         }
>> > As at least one standby snapshot is logged before the checkpoint
>> > record, the redo position is never going to match the previous insert
>> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
>> > Skipping such unnecessary checkpoints is what you would like to
>> > address, no? Because that's what I would like to do here first. And
>> > once we got that right, we could think about addressing the case where
>> > WAL segments are forcibly archived for idle systems.
>>
>> I have put a bit more of brain power into that, and finished with the
>> patch attached. A new field called chkpProgressLSN is attached to
>> XLogCtl, which is to the current insert position of the checkpoint
>> when wal_level <= archive, or the LSN position of the standby snapshot
>> taken after a checkpoint. The bgwriter code is modified as well so as
>> it uses this progress LSN and compares it with the current insert LSN
>> to determine if a standby snapshot should be logged or not. On an
>> instance of Postgres completely idle, no useless checkpoints, and no
>> useless standby snapshots are generated anymore.
>>
>
>
> @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
>   if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> CHECKPOINT_END_OF_RECOVERY |
>    CHECKPOINT_FORCE)) == 0)
>   {
> - if
> (prevPtr == ControlFile->checkPointCopy.redo &&
> + if (GetProgressRecPtr() == prevPtr &&
>
> prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>   {
>
> I think such a check won't consider all the WAL-activity happened
> during checkpoint (after checkpoint start, till it finishes) which was
> not the intention of this code without your patch.

Yes, that's true, in v2 this progress LSN can be updated in two ways:
- At the position where checkpoint has begun
- At the position where standby snapshot was taken after a checkpoint has run.
Honestly, even if we log the snapshot for example before
CheckpointGuts(), that's not going to make it... The main issue here
is that there will be a snapshot after this checkpoint, so the
existing condition is a rather broken concept already when wal_level
>= hot_standby because the redo pointer is never going to match the
previous LSN pointer. Another idea would be to ensure that the
snapshot is logged just after the redo pointer, this would be
reliable.

> I think both this and previous patch (hs-checkpoints-v1) approach
> won't fix the issue in all kind of scenario's.
>
> Let me try to explain what I think should fix this issue based on
> discussion above, suggestions by Andres and some of my own
> thoughts:
>
> Have a new variable in WALInsertLock that would indicate the last
> insertion position (last_insert_pos) we write to after holding that lock.
> Ensure that we don't update last_insert_pos while logging standby
> snapshot (simple way is to pass a flag in XLogInsert or distinguish
> it based on type of record (XLOG_RUNNING_XACTS) or if you can
> think of a more smarter way).

Simplest way is in XLogInsertRecord, the record data is available
there, there is for example some logic related to segment switches.

> Now both during checkpoint and
> in bgwriter, to find out whether there is any activity since last
> time, we need to check all the WALInsertLocks for latest insert position
> (by referring last_insert_pos) and compare it with the latest position
> we have noted during last such check and decide whether to proceed
> or not based on comparison result.  If you think it is not adequate to
> store last_insert_pos in WALInsertLock, then we can think of having
> it in PGPROC.

So the progress check is used when deciding if a checkpoint should be
skipped or not, and when deciding if a standby snapshot should be
taken by the bgwriter? When bgwriter logs a snapshot, it will update
as well the last LSN found (which would be a single variable in for
example XLogCtlData, updated from the data taken from the WAL insert
slots). But there is a problem here: if there is no activity since the
last bgwriter snapshot, the next checkpoint would be skipped as well.
It seems to me that this is not acceptable, a checkpoint generation
would be decided based on if there has been some data activity since
the last snapshot taken, or to put it in other words, no checkpoints
would happen if there has been no activity within 15s after the last
standby snapshot has been logged by the bgwriter.

I have hacked something according to those lines, please see attached
(logs are just here for testing purposes). I may be missing something
obvious regarding the tracking of the last progress position. Code
speaks better than words here I think, so feel free to look at it.

> Yet another idea that occurs to me this morning is that why not
> have a variable in shared memory in XLogCtlInsert or XLogCtl
> similar to CurrBytePos/PrevBytePos which will be updated on
> each XLOGInsert apart from the XLOGInsert for standby snapshots
> and use that in a patch somewhat close to what you have in
> hs-checkpoints-v1.

I have considered that as well, but I do not think it is a good idea
to add more contention in ReserveXLogInsertLocation() while taking
insertpos_lck lock. That's already really hot, and we surely don't
want to make it hotter just to do checks on idle systems. If that's
what you had in mind. Btw, this is basically what I did in the
attached, no? Except that the current insert position is tracked
differently.

>From v2 I sent upthread:
-               if (prevPtr == ControlFile->checkPointCopy.redo &&
+               if (GetProgressRecPtr() == prevPtr &&
                        prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
This is wrong actually. prevPtr should be replaced by the redo LSN.
--
Michael

Attachment

hs-checkpoints-v3.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

04 February 2016, 12:51:45

On Wed, Feb 3, 2016 at 3:08 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-02-02 10:12:25 +0530, Amit Kapila wrote:
> > @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
> > if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> > CHECKPOINT_END_OF_RECOVERY |
> > CHECKPOINT_FORCE)) == 0)
> > {
> > - if
> > (prevPtr == ControlFile->checkPointCopy.redo &&
> > + if (GetProgressRecPtr() == prevPtr &&
> >
> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> > {
> >
> > I think such a check won't consider all the WAL-activity happened
> > during checkpoint (after checkpoint start, till it finishes) which was
> > not the intention of this code without your patch.
>
> Precisely.
>
> > I think both this and previous patch (hs-checkpoints-v1 ) approach
> > won't fix the issue in all kind of scenario's.
>
> Agreed.
>
> > Let me try to explain what I think should fix this issue based on
> > discussion above, suggestions by Andres and some of my own
> > thoughts:
>
> > Have a new variable in WALInsertLock that would indicate the last
> > insertion position (last_insert_pos) we write to after holding that lock.
> > Ensure that we don't update last_insert_pos while logging standby
> > snapshot (simple way is to pass a flag in XLogInsert or distinguish
> > it based on type of record (XLOG_RUNNING_XACTS) or if you can
> > think of a more smarter way). Now both during checkpoint and
> > in bgwriter, to find out whether there is any activity since last
> > time, we need to check all the WALInsertLocks for latest insert position
> > (by referring last_insert_pos) and compare it with the latest position
> > we have noted during last such check and decide whether to proceed
> > or not based on comparison result. If you think it is not adequate to
> > store last_insert_pos in WALInsertLock, then we can think of having
> > it in PGPROC.
>
> Yes, that's pretty much what I was thinking of.
>

I think generally it is good idea, but one thing worth a thought is that

by doing so, we need to acquire all WAL Insertion locks every

LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for

every slot, do you think it is matter of concern in any way for write

workloads or it won't effect as we need to do this periodically?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

04 February 2016, 13:10:15

On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
> I think generally it is good idea, but one thing worth a thought is that
> by doing so, we need to acquire all WAL Insertion locks every
> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
> every slot, do you think it is matter of concern in any way for write
> workloads or it won't effect as we need to do this periodically?

Michael and I just had an in-person discussion, and one of the topics
was that. The plan was basically to adapt the patch to:
1) Store the progress lsn inside the wal insert lock
2) Change the HasActivity API to return an the last LSN at which there  was activity, instead of a boolean.
2) Individually acquire each insert locks's lwlock to get it's progress  LSN, but not the exclusive insert lock. We
needthe lwllock to avoid  a torn 8byte read on some platforms.

I think that mostly should address your concerns?

Andres

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

04 February 2016, 13:19:22

On Thu, Feb 4, 2016 at 6:40 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
> > I think generally it is good idea, but one thing worth a thought is that
> > by doing so, we need to acquire all WAL Insertion locks every
> > LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
> > every slot, do you think it is matter of concern in any way for write
> > workloads or it won't effect as we need to do this periodically?
>
> Michael and I just had an in-person discussion, and one of the topics
> was that. The plan was basically to adapt the patch to:
> 1) Store the progress lsn inside the wal insert lock
> 2) Change the HasActivity API to return an the last LSN at which there
> was activity, instead of a boolean.
> 2) Individually acquire each insert locks's lwlock to get it's progress
> LSN, but not the exclusive insert lock. We need the lwllock to avoid
> a torn 8byte read on some platforms.
>
> I think that mostly should address your concerns?
>

Yes, this sounds better and in-anycase we can do some benchmarks

to verify the same once patch is in shape.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

04 February 2016, 15:38:33

On Thu, Feb 4, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
>> I think generally it is good idea, but one thing worth a thought is that
>> by doing so, we need to acquire all WAL Insertion locks every
>> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
>> every slot, do you think it is matter of concern in any way for write
>> workloads or it won't effect as we need to do this periodically?
>
> Michael and I just had an in-person discussion, and one of the topics
> was that. The plan was basically to adapt the patch to:
> 1) Store the progress lsn inside the wal insert lock
> 2) Change the HasActivity API to return an the last LSN at which there
>    was activity, instead of a boolean.
> 3) Individually acquire each insert locks's lwlock to get it's progress
>    LSN, but not the exclusive insert lock. We need the lwllock to avoid
>    a torn 8byte read on some platforms.

4) Switch the condition to decide if a checkpoint should be skipped
using the last activity position compared with ProcLastRecPtr in
CreateCheckpoint to see if any activity has occurred since the
checkpoint record was inserted, and do not care anymore if the
previous record and current record are on different segments. This
would basically work.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

06 February 2016, 13:03:21

On Thu, Feb 4, 2016 at 6:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
>>> I think generally it is good idea, but one thing worth a thought is that
>>> by doing so, we need to acquire all WAL Insertion locks every
>>> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
>>> every slot, do you think it is matter of concern in any way for write
>>> workloads or it won't effect as we need to do this periodically?
>>
>> Michael and I just had an in-person discussion, and one of the topics
>> was that. The plan was basically to adapt the patch to:
>> 1) Store the progress lsn inside the wal insert lock
>> 2) Change the HasActivity API to return an the last LSN at which there
>>    was activity, instead of a boolean.
>> 3) Individually acquire each insert locks's lwlock to get it's progress
>>    LSN, but not the exclusive insert lock. We need the lwllock to avoid
>>    a torn 8byte read on some platforms.
>
> 4) Switch the condition to decide if a checkpoint should be skipped
> using the last activity position compared with ProcLastRecPtr in
> CreateCheckpoint to see if any activity has occurred since the
> checkpoint record was inserted, and do not care anymore if the
> previous record and current record are on different segments. This
> would basically work.

OK, attached is a patch that I believe addresses those issues. The
patch still has a couple of LOG entries that we had better remove in
the version that gets pushed, but they have proved to be useful for me
when testing the patch with a low checkpoint_timeout value to see if
checkpoints are properly skipped on an idle system. I found myself
adding another routine called GetLastCheckpointRecPtr() for the
bgwriter because ControlFile is not declared externally even if it is
in shared memory. I think that's better this way.

The original bug report referred to a low archive_timeout causing
standby snapshots to be logged when segments are switched. So we may
want at the end to not update the progress LSN for segment switch
records as well, but I have let that out of the patch for the time
being to address the primary concern of unnecessary checkpoints for
wal_level >= hs. We could address it with a later patch (planning to
do so), let's keep a step-by-step approach.

The patch attached will apply on master, on 9.5 there is one minor
conflict. For older versions we will need another reworked patch. I am
fine to produce those once we are fine with the shape of what gets
into master and 9.5. 9.4 uses WAL insert locks so things are rather
similar. For ~9.3, I think that we are going to need a single variable
in XLogCtl or similar to track the progress, but I have not looked
into that in details yet.
--
Michael

Attachment

hs-checkpoints-v4.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

06 February 2016, 17:49:37

On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
> The patch attached will apply on master, on 9.5 there is one minor
> conflict. For older versions we will need another reworked patch.

FWIW, I don't think we should backpatch this. It'd look noticeably
different in the back branches, and this isn't really a critical
issue. I think it makes sense to see this as an optimization.

> +    /*
> +     * Update the progress LSN positions. At least one WAL insertion lock
> +     * is already taken appropriately before doing that, and it is just more
> +     * simple to do that here where WAL record data and type is at hand.
> +     * The progress is set at the start position of the record tracked that
> +     * is being added, making easier checkpoint progress tracking as the
> +     * control file already saves the start LSN position of the last
> +     * checkpoint run.
> +     */
> +    if (!isStandbySnapshot)
> +    {

I don't like isStandbySnapshot much, it seems better to do this more
generally, via a flag passed down by the inserter.

> +        if (holdingAllLocks)
> +        {
> +            int i;
> +
> +            for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
> +                WALInsertLocks[i].l.progressAt = StartPos;

Why update all?

>  /*
> + * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
> + * at the last significant WAL activity, or in other words any activity
> + * not referring to standby logging as of now. Finding the last activity
> + * position is done by scanning each WAL insertion lock by taking directly
> + * the light-weight lock associated to it.
> + */
> +XLogRecPtr
> +GetProgressRecPtr(void)
> +{
> +    XLogRecPtr    res = InvalidXLogRecPtr;
> +    int            i;
> +
> +    /*
> +     * Look at the latest LSN position referring to the activity done by
> +     * WAL insertion.
> +     */
> +    for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
> +    {
> +        XLogRecPtr    progress_lsn;
> +
> +        LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
> +        progress_lsn = WALInsertLocks[i].l.progressAt;
> +        LWLockRelease(&WALInsertLocks[i].l.lock);

Add a comment explaining that we a) need a lock because of the potential
for "torn reads" on some platforms. b) need an exclusive one, because
the insert lock logic currently only expects exclusive locks.

>      /*
> +     * Fetch the progress position before taking any WAL insert lock. This
> +     * is normally an operation that does not take long, but leaving this
> +     * lookup out of the section taken an exclusive lock saves a couple
> +     * of instructions.
> +     */
> +    progress_lsn = GetProgressRecPtr();

too long for my taste. How about:
/* get progress, before acuiring insert locks to shorten locked section */


Looks much better now.

Greetings,

Andres Freund

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

07 February 2016, 00:19:14

On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
>> +     /*
>> +      * Update the progress LSN positions. At least one WAL insertion lock
>> +      * is already taken appropriately before doing that, and it is just more
>> +      * simple to do that here where WAL record data and type is at hand.
>> +      * The progress is set at the start position of the record tracked that
>> +      * is being added, making easier checkpoint progress tracking as the
>> +      * control file already saves the start LSN position of the last
>> +      * checkpoint run.
>> +      */
>> +     if (!isStandbySnapshot)
>> +     {
>
> I don't like isStandbySnapshot much, it seems better to do this more
> generally, via a flag passed down by the inserter.

Instead of updating every single call of XLogInsert() in the system,
what do you think about introducing a new routine XLogInsertExtended()
that would have this optional flag? This would wrap the existing
XLogInsert() and pass the flag to XLogInsertRecord().
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

08 February 2016, 06:58:53

On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
>> The patch attached will apply on master, on 9.5 there is one minor
>> conflict. For older versions we will need another reworked patch.
>
> FWIW, I don't think we should backpatch this. It'd look noticeably
> different in the back branches, and this isn't really a critical
> issue. I think it makes sense to see this as an optimization.

The original bug report of this thread is based on 9.4, which is the
first version where the standby snapshot in the bgwriter has been
introduced. Knowing that most of the systems in the world are actually
let idle. If those are running Postgres, I'd like to think that it is
a waste. Now it is true that this is not a critical issue.

>> +     /*
>> +      * Update the progress LSN positions. At least one WAL insertion lock
>> +      * is already taken appropriately before doing that, and it is just more
>> +      * simple to do that here where WAL record data and type is at hand.
>> +      * The progress is set at the start position of the record tracked that
>> +      * is being added, making easier checkpoint progress tracking as the
>> +      * control file already saves the start LSN position of the last
>> +      * checkpoint run.
>> +      */
>> +     if (!isStandbySnapshot)
>> +     {
>
> I don't like isStandbySnapshot much, it seems better to do this more
> generally, via a flag passed down by the inserter.

Hm, OK. I think that the best way to deal with that is to introduce
XLogInsertExtended which wraps the existing XLogInsert(). By default
the flag is true. This will reduce the footprint of this patch on the
code base and will ease future backpatches on those code paths at
least down to 9.5 where WAL refactoring has been introduced.

>> +             if (holdingAllLocks)
>> +             {
>> +                     int i;
>> +
>> +                     for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
>> +                             WALInsertLocks[i].l.progressAt = StartPos;
>
> Why update all?

For consistency. I agree that updating one, say the first one is
enough. I have added a comment explaining that as well.

>>  /*
>> + * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
>> + * at the last significant WAL activity, or in other words any activity
>> + * not referring to standby logging as of now. Finding the last activity
>> + * position is done by scanning each WAL insertion lock by taking directly
>> + * the light-weight lock associated to it.
>> + */
>> +XLogRecPtr
>> +GetProgressRecPtr(void)
>> +{
>> +     XLogRecPtr      res = InvalidXLogRecPtr;
>> +     int                     i;
>> +
>> +     /*
>> +      * Look at the latest LSN position referring to the activity done by
>> +      * WAL insertion.
>> +      */
>> +     for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
>> +     {
>> +             XLogRecPtr      progress_lsn;
>> +
>> +             LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
>> +             progress_lsn = WALInsertLocks[i].l.progressAt;
>> +             LWLockRelease(&WALInsertLocks[i].l.lock);
>
> Add a comment explaining that we a) need a lock because of the potential
> for "torn reads" on some platforms. b) need an exclusive one, because
> the insert lock logic currently only expects exclusive locks.

Check.

>>       /*
>> +      * Fetch the progress position before taking any WAL insert lock. This
>> +      * is normally an operation that does not take long, but leaving this
>> +      * lookup out of the section taken an exclusive lock saves a couple
>> +      * of instructions.
>> +      */
>> +     progress_lsn = GetProgressRecPtr();
>
> too long for my taste. How about:
> /* get progress, before acuiring insert locks to shorten locked section */

Check.

> Looks much better now.

Thanks. Let's wrap that! An updated patch is attached.
--
Michael

Attachment

hs-checkpoints-v5.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

08 February 2016, 09:18:57

On 2016-02-08 15:58:49 +0900, Michael Paquier wrote:
> On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
> >> The patch attached will apply on master, on 9.5 there is one minor
> >> conflict. For older versions we will need another reworked patch.
> >
> > FWIW, I don't think we should backpatch this. It'd look noticeably
> > different in the back branches, and this isn't really a critical
> > issue. I think it makes sense to see this as an optimization.
> 
> The original bug report of this thread is based on 9.4, which is the
> first version where the standby snapshot in the bgwriter has been
> introduced. Knowing that most of the systems in the world are actually
> let idle. If those are running Postgres, I'd like to think that it is
> a waste. Now it is true that this is not a critical issue.

Unconvinced. For one, the equivalent behaviour for checkpoints has
existed since at least 9.0. Secondly, the problem only really appears if
you use archiving, configure a nondefault archive timeout, and that
timeout is significantly smaller than checkpoint timeout. And then the
cost is partially filled segment every now and then.  That just doesn't
stand into relation into adding some nontrivial code into backbranches.

> >> +             if (holdingAllLocks)
> >> +             {
> >> +                     int i;
> >> +
> >> +                     for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
> >> +                             WALInsertLocks[i].l.progressAt = StartPos;
> >
> > Why update all?
> 
> For consistency. I agree that updating one, say the first one is
> enough. I have added a comment explaining that as well.

We don't set valid values in WALInsertLockAcquireExclusive for all locks
either. I don't see consistency to what this is going to buy us.

>  /*
> + * XLogInsert
> + *
> + * A shorthand for XLogInsertExtended, to update the progress of WAL
> + * activity by default.
> + */
> +XLogRecPtr
> +XLogInsert(RmgrId rmid, uint8 info)
> +{
> +    return XLogInsertExtended(rmid, info, true);
> +}
> +
> +/*
> + * XLogInsertExtended

I'm not really a fan. I'd rather change existing callers to add a
'flags' bitmask argument. Right now there can't really be XLogInserts()
in extension code, so that's pretty ok to change.

Greetings,

Andres Freund

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

08 February 2016, 11:47:42

On Mon, Feb 8, 2016 at 6:18 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-08 15:58:49 +0900, Michael Paquier wrote:
>> On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund <andres@anarazel.de> wrote:
>>  /*
>> + * XLogInsert
>> + *
>> + * A shorthand for XLogInsertExtended, to update the progress of WAL
>> + * activity by default.
>> + */
>> +XLogRecPtr
>> +XLogInsert(RmgrId rmid, uint8 info)
>> +{
>> +     return XLogInsertExtended(rmid, info, true);
>> +}
>> +
>> +/*
>> + * XLogInsertExtended
>
> I'm not really a fan. I'd rather change existing callers to add a
> 'flags' bitmask argument. Right now there can't really be XLogInserts()
> in extension code, so that's pretty ok to change.

So, you mean to extend directly XLogInsert() with an int flag that has
only one possible value for now, and update *all* the calls of
XLogInsert(). Well, I am detecting 218 calls of XLogInsert() in the
code, so that's not a small change. Honestly, I'd really rather have
the Extended() routine with default taking 0 for the integer flag, 0
meaning in the case of the progress that it gets updated. We have
similar routines for RangeVar, some in bufmgr.c and some other places.
So, which way do you want to take? I don't expect to win this argument
if that's a 1vs1 against a committer, just I would like to be sure
that I am not doing useless things. Time matters.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

08 February 2016, 14:24:40

On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
>
> >> /*
> >> + * Fetch the progress position before taking any WAL insert lock. This
> >> + * is normally an operation that does not take long, but leaving this
> >> + * lookup out of the section taken an exclusive lock saves a couple
> >> + * of instructions.
> >> + */
> >> + progress_lsn = GetProgressRecPtr();
> >
> > too long for my taste. How about:
> > /* get progress, before acuiring insert locks to shorten locked section */
>
> Check.
>

What is the need of holding locks one-by-one during checkpoint when

we anyway are going to take lock on all the insertion slots.

+ * to not rely on taking an exclusive lock an all the WAL insertion locks,

/an all/on all

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

09 February 2016, 00:07:39

On Mon, Feb 8, 2016 at 11:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>>
>> >>       /*
>> >> +      * Fetch the progress position before taking any WAL insert lock.
>> >> This
>> >> +      * is normally an operation that does not take long, but leaving
>> >> this
>> >> +      * lookup out of the section taken an exclusive lock saves a
>> >> couple
>> >> +      * of instructions.
>> >> +      */
>> >> +     progress_lsn = GetProgressRecPtr();
>> >
>> > too long for my taste. How about:
>> > /* get progress, before acuiring insert locks to shorten locked section
>> > */
>>
>> Check.
>>
>
> What is the need of holding locks one-by-one during checkpoint when
> we anyway are going to take lock on all the insertion slots.

A couple of records can slip in while scanning the progress LSN
through all the locks.

> + * to not rely on taking an exclusive lock an all the WAL insertion locks,
>
> /an all/on all

Nice catch.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

09 February 2016, 05:27:05

On Tue, Feb 9, 2016 at 5:37 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Mon, Feb 8, 2016 at 11:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Feb 8, 2016 at 12:28 PM, Michael Paquier <michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >> >> /*
> >> >> + * Fetch the progress position before taking any WAL insert lock.
> >> >> This
> >> >> + * is normally an operation that does not take long, but leaving
> >> >> this
> >> >> + * lookup out of the section taken an exclusive lock saves a
> >> >> couple
> >> >> + * of instructions.
> >> >> + */
> >> >> + progress_lsn = GetProgressRecPtr();
> >> >
> >> > too long for my taste. How about:
> >> > /* get progress, before acuiring insert locks to shorten locked section
> >> > */
> >>
> >> Check.
> >>
> >
> > What is the need of holding locks one-by-one during checkpoint when
> > we anyway are going to take lock on all the insertion slots.
>
> A couple of records can slip in while scanning the progress LSN
> through all the locks.
>

Do you see any benefit in allowing checkpoints for such cases considering

bgwriter will anyway take care of logging standby snapshot for such

cases?

I don't think there is any reasonable benefit by improving the situation of

idle-system check for checkpoint other than just including

standbysnapshot WAL record. OTOH as checkpoint is not so seldom

operation, so allowing such a change should be okay, but I don't see

the need for same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

09 February 2016, 12:38:11

On Tue, Feb 9, 2016 at 2:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Do you see any benefit in allowing checkpoints for such cases considering
> bgwriter will anyway take care of logging standby snapshot for such
> cases?

Well, the idea is to improve the system responsiveness. Imagine that
the call to GetProgressRecPtr() is done within the exclusive lock
portion, but that while scanning the WAL insert lock entries another
backend is waiting to take a lock on them, and this backend is trying
to insert the first record that updates the progress LSN since the
last checkpoint. In this case the checkpoint would be skipped.
However, imagine that such a record is able to get through it while
scanning the progress values in the WAL insert locks, in which case a
checkpoint would be generated. In any case, I think that we should try
to get exclusive lock areas as small as possible if we have ways to do
so.

> I don't think there is any reasonable benefit to improve the situation of
> idle-system check for checkpoint other than just including
> standby snapshot WAL record.  OTOH as checkpoint is not so seldom
> operation, so allowing such a change should be okay, but I don't see
> the need for same.

I am not sure I understand your point here...
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

09 February 2016, 13:42:58

On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Feb 9, 2016 at 2:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do you see any benefit in allowing checkpoints for such cases considering
> > bgwriter will anyway take care of logging standby snapshot for such
> > cases?
>
> Well, the idea is to improve the system responsiveness. Imagine that
> the call to GetProgressRecPtr() is done within the exclusive lock
> portion, but that while scanning the WAL insert lock entries another
> backend is waiting to take a lock on them, and this backend is trying
> to insert the first record that updates the progress LSN since the
> last checkpoint. In this case the checkpoint would be skipped.
> However, imagine that such a record is able to get through it while
> scanning the progress values in the WAL insert locks, in which case a
> checkpoint would be generated.

Such case was not covered without your patch and I don't see the

need of same especially at the cost of taking locks.

> In any case, I think that we should try
> to get exclusive lock areas as small as possible if we have ways to do
> so.
>

Sure, but not when you are already going to take lock for longer

duration.

- last_snapshot_lsn != GetXLogInsertRecPtr())

GetLastCheckpointRecPtr() < GetProgressRecPtr())

How the above check in patch suppose to work?

I think it could so happen once bgwriter logs snapshot, both checkpoint

and progressAt won't get updated any further and I think this check will

allow to log snapshots in such a case as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

10 February 2016, 01:49:34

On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote:
> On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote:
>> Well, the idea is to improve the system responsiveness. Imagine that
>> the call to GetProgressRecPtr() is done within the exclusive lock
>> portion, but that while scanning the WAL insert lock entries another
>> backend is waiting to take a lock on them, and this backend is trying
>> to insert the first record that updates the progress LSN since the
>> last checkpoint. In this case the checkpoint would be skipped.
>> However, imagine that such a record is able to get through it while
>> scanning the progress values in the WAL insert locks, in which case a
>> checkpoint would be generated.
>
> Such case was not covered without your patch and I don't see the
> need of same especially at the cost of taking locks.

In this code path that's quite cheap, and it clearly improves the
decision-making when wal_level >= hs which is now rather broken (say
not optimized much).

>>  In any case, I think that we should try
>> to get exclusive lock areas as small as possible if we have ways to do
>> so.
>>
>
> Sure, but not when you are already going to take lock for longer
> duration.

Why would an exclusive lock be taken for a longer duration in the
checkpoint portion? Do you mean that the global time to take exclusive
locks gets increased? For each individual WAL insert lock, yes that's
the case, but that's still far cheaper to have this evaluation logic
here than in ReserveXLogInsertLocation().

> - last_snapshot_lsn != GetXLogInsertRecPtr())
> +
> GetLastCheckpointRecPtr() < GetProgressRecPtr())
>
> How the above check in patch suppose to work?
> I think it could so happen once bgwriter logs snapshot, both checkpoint
> and progressAt won't get updated any further and I think this check will
> allow to log snapshots in such a case as well.

The only purpose of this check is to do the following thing: if no WAL
activity has happened since last checkpoint, there is no need to log a
standby snapshot from the perspective of the bgwriter. In case the
system is idle, we want to skip logging that and avoid unnecessary
checkpoints because those records would have been generated. If the
system is not idle, or to put it in other words there has been at
least one record since the last checkpoint, we would log a standby
snapshot, enforcing as well a checkpoint to happen the next time
CreateCheckpoint() is gone through, and a standby snapshot is logged
as well after the checkpoint contents are flushed. I am not sure I
understand what you are getting at...
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

10 February 2016, 03:41:44

On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote:
> > On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote:
> >> Well, the idea is to improve the system responsiveness. Imagine that
> >> the call to GetProgressRecPtr() is done within the exclusive lock
> >> portion, but that while scanning the WAL insert lock entries another
> >> backend is waiting to take a lock on them, and this backend is trying
> >> to insert the first record that updates the progress LSN since the
> >> last checkpoint. In this case the checkpoint would be skipped.
> >> However, imagine that such a record is able to get through it while
> >> scanning the progress values in the WAL insert locks, in which case a
> >> checkpoint would be generated.
> >
> > Such case was not covered without your patch and I don't see the
> > need of same especially at the cost of taking locks.
>
> In this code path that's quite cheap, and it clearly improves the
> decision-making when wal_level >= hs which is now rather broken (say
> not optimized much).
>
> >> In any case, I think that we should try
> >> to get exclusive lock areas as small as possible if we have ways to do
> >> so.
> >>
> >
> > Sure, but not when you are already going to take lock for longer
> > duration.
>
> Why would an exclusive lock be taken for a longer duration in the
> checkpoint portion?

Consider below code:

+ * Get progress before acquiring insert locks to shorten the locked

+ * section waiting ahead.

+ */

+ progress_lsn = GetProgressRecPtr();

+ /*

* We must block concurrent insertions while examining insert state to

* determine the checkpoint REDO pointer.

WALInsertLockAcquireExclusive();

In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to

retrieve the latest value of progressAt and then again it will take

all the WALInsertLocks in WALInsertLockAcquireExclusive() and then

do some work. Now I think it is okay to retrieve the latest of progressAt

after WALInsertLockAcquireExclusive() as you don't need to take the

locks again.

>
> > - last_snapshot_lsn != GetXLogInsertRecPtr())
> > +
> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
> >
> > How the above check in patch suppose to work?
> > I think it could so happen once bgwriter logs snapshot, both checkpoint
> > and progressAt won't get updated any further and I think this check will
> > allow to log snapshots in such a case as well.
>
> The only purpose of this check is to do the following thing: if no WAL
> activity has happened since last checkpoint, there is no need to log a
> standby snapshot from the perspective of the bgwriter. In case the
> system is idle, we want to skip logging that and avoid unnecessary
> checkpoints because those records would have been generated. If the
> system is not idle, or to put it in other words there has been at
> least one record since the last checkpoint, we would log a standby
> snapshot, enforcing as well a checkpoint to happen the next time
> CreateCheckpoint() is gone through, and a standby snapshot is logged
> as well after the checkpoint contents are flushed. I am not sure I
> understand what you are getting at...
>

Let me try to say again, suppose ControlFile->checkPoint is at

100 and latest value of progressAt returned by GetProgressRecPtr

is 105, so the first time the above check happens, it will allow

to log standby snapshot which is okay, now assume there is no

activity, neither there is any checkpoint and again after

LOG_SNAPSHOT_INTERVAL_MS interval, when the above check

gets executed, it will pass and log the standby snapshot which is

*not* okay, because we don't want to log standby snapshot when

there is no activity. Am I missing something?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

10 February 2016, 06:48:52

On Wed, Feb 10, 2016 at 12:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Tue, Feb 9, 2016 at 10:42 PM, Amit Kapila wrote:
>> > On Tue, Feb 9, 2016 at 6:08 PM, Michael Paquier wrote:
>> >> Well, the idea is to improve the system responsiveness. Imagine that
>> >> the call to GetProgressRecPtr() is done within the exclusive lock
>> >> portion, but that while scanning the WAL insert lock entries another
>> >> backend is waiting to take a lock on them, and this backend is trying
>> >> to insert the first record that updates the progress LSN since the
>> >> last checkpoint. In this case the checkpoint would be skipped.
>> >> However, imagine that such a record is able to get through it while
>> >> scanning the progress values in the WAL insert locks, in which case a
>> >> checkpoint would be generated.
>> >
>> > Such case was not covered without your patch and I don't see the
>> > need of same especially at the cost of taking locks.
>>
>> In this code path that's quite cheap, and it clearly improves the
>> decision-making when wal_level >= hs which is now rather broken (say
>> not optimized much).
>>
>> >> In any case, I think that we should try
>> >> to get exclusive lock areas as small as possible if we have ways to do
>> >> so.
>> >>
>> >
>> > Sure, but not when you are already going to take lock for longer
>> > duration.
>>
>> Why would an exclusive lock be taken for a longer duration in the
>> checkpoint portion?
>
> Consider below code:
>
> /*
> + * Get progress before acquiring insert locks to shorten the locked
> + * section waiting ahead.
> + */
> + progress_lsn = GetProgressRecPtr();
> +
> + /*
> * We must block concurrent insertions while examining insert state to
> * determine the checkpoint REDO pointer.
> */
> WALInsertLockAcquireExclusive();
>
> In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
> retrieve the latest value of progressAt and then again it will take
> all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
> do some work. Now I think it is okay to retrieve the latest of progressAt
> after WALInsertLockAcquireExclusive() as you don't need to take the
> locks again.

Sure, that would be OK. Now I am not on board if this means to have the checkpoint take an exclusive locks for a bit longer duration if that's not actually necessary.

>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?

Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)

That's your point?

--
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

10 February 2016, 07:11:09

On Wed, Feb 10, 2016 at 12:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Feb 10, 2016 at 12:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>
> Consider below code:
>
> /*
> + * Get progress before acquiring insert locks to shorten the locked
> + * section waiting ahead.
> + */
> + progress_lsn = GetProgressRecPtr();
> +
> + /*
> * We must block concurrent insertions while examining insert state to
> * determine the checkpoint REDO pointer.
> */
> WALInsertLockAcquireExclusive();
>
> In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
> retrieve the latest value of progressAt and then again it will take
> all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
> do some work. Now I think it is okay to retrieve the latest of progressAt
> after WALInsertLockAcquireExclusive() as you don't need to take the
> locks again.

Sure, that would be OK. Now I am not on board if this means to have the checkpoint take an exclusive locks for a bit longer duration if that's not actually necessary.

Taking and releasing locks again and again (which is done in patch)

matters much more than adding few instructions under it and I think

if you would have written the code in-a-way as in patch in some of

the critical path, it would have been regressed badly, but because

checkpoint doesn't happen often, reproducing regression is difficult.

Anyway, there is no-point in too much argument, I think you have

understood what I wanted to say and if you think the way code is

currently written is better, then lets leave as it is for somebody else

to give suggestion on it.

>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?

Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)

Why do you think including checkpoint related check is

required, how about something like below:

(now >= timeout &&
last_progress_ptr < current_progress_ptr)

That's your point?

Yes.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

10 February 2016, 07:31:38

On Wed, Feb 10, 2016 at 4:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 10, 2016 at 12:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Feb 10, 2016 at 12:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 7:17 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>
> Consider below code:
>
> /*
> + * Get progress before acquiring insert locks to shorten the locked
> + * section waiting ahead.
> + */
> + progress_lsn = GetProgressRecPtr();
> +
> + /*
> * We must block concurrent insertions while examining insert state to
> * determine the checkpoint REDO pointer.
> */
> WALInsertLockAcquireExclusive();
>
> In GetProgressRecPtr(), first it take WALInsertLocks one-by-one to
> retrieve the latest value of progressAt and then again it will take
> all the WALInsertLocks in WALInsertLockAcquireExclusive() and then
> do some work. Now I think it is okay to retrieve the latest of progressAt
> after WALInsertLockAcquireExclusive() as you don't need to take the
> locks again.

Sure, that would be OK. Now I am not on board if this means to have the checkpoint take an exclusive locks for a bit longer duration if that's not actually necessary.

Taking and releasing locks again and again (which is done in patch)
matters much more than adding few instructions under it and I think
if you would have written the code in-a-way as in patch in some of
the critical path, it would have been regressed badly, but because
checkpoint doesn't happen often, reproducing regression is difficult.
Anyway, there is no-point in too much argument, I think you have
understood what I wanted to say and if you think the way code is
currently written is better, then lets leave as it is for somebody else
to give suggestion on it.

Okay.

>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?

Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)

Why do you think including checkpoint related check is
required, how about something like below:

(now >= timeout &&
last_progress_ptr < current_progress_ptr)

We need as well to be sure that no standby snapshots are logged just after a checkpoint in this code path when last_progress_ptr is referring to an LSN position *before* the last checkpoint LSN. There is already one snapshot in CreateCheckpoint() that is logged, there is no need of an extra one, explaining why the check on progress since last checkpoint is necessary to me.

Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

10 February 2016, 08:00:38

On Wed, Feb 10, 2016 at 1:01 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Feb 10, 2016 at 4:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

>> > - last_snapshot_lsn != GetXLogInsertRecPtr())
>> > +
>> > GetLastCheckpointRecPtr() < GetProgressRecPtr())
>> >
>> > How the above check in patch suppose to work?
>> > I think it could so happen once bgwriter logs snapshot, both checkpoint
>> > and progressAt won't get updated any further and I think this check will
>> > allow to log snapshots in such a case as well.
>>
>> The only purpose of this check is to do the following thing: if no WAL
>> activity has happened since last checkpoint, there is no need to log a
>> standby snapshot from the perspective of the bgwriter. In case the
>> system is idle, we want to skip logging that and avoid unnecessary
>> checkpoints because those records would have been generated. If the
>> system is not idle, or to put it in other words there has been at
>> least one record since the last checkpoint, we would log a standby
>> snapshot, enforcing as well a checkpoint to happen the next time
>> CreateCheckpoint() is gone through, and a standby snapshot is logged
>> as well after the checkpoint contents are flushed. I am not sure I
>> understand what you are getting at...
>
> Let me try to say again, suppose ControlFile->checkPoint is at
> 100 and latest value of progressAt returned by GetProgressRecPtr
> is 105, so the first time the above check happens, it will allow
> to log standby snapshot which is okay, now assume there is no
> activity, neither there is any checkpoint and again after
> LOG_SNAPSHOT_INTERVAL_MS interval, when the above check
> gets executed, it will pass and log the standby snapshot which is
> *not* okay, because we don't want to log standby snapshot when
> there is no activity. Am I missing something?

Ah, right. Sorry for not getting your point. OK now I got it. So basically what the code does not is that if there is just one record after a checkpoint we would log every 15s a standby snapshot until the next checkpoint even if nothing happens, but we can save a bunch of standby records. So your point is that we need to save the result of GetProgressRecPtr() at each loop in the bgwriter when a standby snapshot is taken, say in a static XLogRecPtr called last_progress_lsn. And then at the next loop we compare it on the current result of GetProgressRecPtr(), so the condition to check if a standby snapshot should be generated or not in the bgwriter becomes that:
(now >= timeout &&
GetLastCheckpointRecPtr() < current_progress_ptr &&
last_progress_ptr < current_progress_ptr)

Why do you think including checkpoint related check is
required, how about something like below:

(now >= timeout &&
last_progress_ptr < current_progress_ptr)

We need as well to be sure that no standby snapshots are logged just after a checkpoint in this code path when last_progress_ptr is referring to an LSN position *before* the last checkpoint LSN. There is already one snapshot in CreateCheckpoint() that is logged, there is no need of an extra one, explaining why the check on progress since last checkpoint is necessary to me.

Okay, but isn't it better that we remove the snapshot taken

at checkpoint time in the main branch or till where this code is

getting back-patched. Do you see any need of same after

having the logging of snapshot in bgwriter?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

10 February 2016, 08:14:36

On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, but isn't it better that we remove the snapshot taken
> at checkpoint time in the main branch or till where this code is
> getting back-patched.   Do you see any need of same after
> having the logging of snapshot in bgwriter?

But this one is necessary as well to allow hot standby faster to
initialize, no? Particularly in the case where a bgwriter snapshot
would have been taken just before the checkpoint, there may be up to
15s until the next one.

And AFAIK, this code would not get a back-patch, as stated by Andres
upthread :( I would think that it is better to actually get a
backpatch, but well...
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

11 February 2016, 03:55:42

On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Okay, but isn't it better that we remove the snapshot taken
> > at checkpoint time in the main branch or till where this code is
> > getting back-patched. Do you see any need of same after
> > having the logging of snapshot in bgwriter?
>
> But this one is necessary as well to allow hot standby faster to
> initialize, no? Particularly in the case where a bgwriter snapshot
> would have been taken just before the checkpoint, there may be up to
> 15s until the next one.
>

It could be helpful if checkpoint is done at shorter intervals, otherwise

we anyway log it at 15s interval and if need faster initialisation

of hot-standby, then it is better to reduce the log snapshot interval

in bgwriter. I think if go by my reasoning then it should have been

removed when bgwriter is changed to log snapshot, so I think you

can take this suggestion if you also feel the same, anyway this

is not a correctness issue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

11 February 2016, 08:20:22

On 2016-02-11 09:25:30 +0530, Amit Kapila wrote:
> On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
> >
> > On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > > Okay, but isn't it better that we remove the snapshot taken
> > > at checkpoint time in the main branch or till where this code is
> > > getting back-patched.   Do you see any need of same after
> > > having the logging of snapshot in bgwriter?
> >
> > But this one is necessary as well to allow hot standby faster to
> > initialize, no? Particularly in the case where a bgwriter snapshot
> > would have been taken just before the checkpoint, there may be up to
> > 15s until the next one.
> >
> 
> It could be helpful if checkpoint is done at shorter intervals, otherwise
> we anyway log it at 15s interval and if need faster initialisation
> of hot-standby, then it is better to reduce the log snapshot interval
> in bgwriter.

No. By emitting the first snapshot directly after the determination of
the redo pointer, we can get into STANDBY_SNAPSHOT_READY more
quickly. Especially if some of the snapshots are overflowed. During
startup 15s can be a *long* time; but on the other hand there's not much
benefit at logging snapshots more frequently when the system is up.

I don't think we should tinker with the frequency/logging points, while
fixing the issue here.

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

11 February 2016, 08:46:31

On Thu, Feb 11, 2016 at 5:20 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-11 09:25:30 +0530, Amit Kapila wrote:
>> On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>> >
>> > On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > > Okay, but isn't it better that we remove the snapshot taken
>> > > at checkpoint time in the main branch or till where this code is
>> > > getting back-patched.   Do you see any need of same after
>> > > having the logging of snapshot in bgwriter?
>> >
>> > But this one is necessary as well to allow hot standby faster to
>> > initialize, no? Particularly in the case where a bgwriter snapshot
>> > would have been taken just before the checkpoint, there may be up to
>> > 15s until the next one.
>> >
>>
>> It could be helpful if checkpoint is done at shorter intervals, otherwise
>> we anyway log it at 15s interval and if need faster initialisation
>> of hot-standby, then it is better to reduce the log snapshot interval
>> in bgwriter.
>
> No. By emitting the first snapshot directly after the determination of
> the redo pointer, we can get into STANDBY_SNAPSHOT_READY more
> quickly. Especially if some of the snapshots are overflowed. During
> startup 15s can be a *long* time; but on the other hand there's not much
> benefit at logging snapshots more frequently when the system is up.
> I don't think we should tinker with the frequency/logging points, while
> fixing the issue here.

Agreement from here on this point. Andres, what's still remaining on
my side is to know how to we detect in XLoginsert() how we update the
LSN progress. I was willing to add XLogInsertExtended() with a
complementary int/bool flag and let XLogInsert() alone as you don't
like much doing this decision making using just the record type as I
did in one of the patch upthread, however you did not like this idea
either. What exactly do you have in mind?
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

12 February 2016, 13:27:17

On Thu, Feb 11, 2016 at 5:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 11, 2016 at 5:20 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-02-11 09:25:30 +0530, Amit Kapila wrote:
>>> On Wed, Feb 10, 2016 at 1:42 PM, Michael Paquier <michael.paquier@gmail.com>
>>> wrote:
>>> >
>>> > On Wed, Feb 10, 2016 at 5:00 PM, Amit Kapila <amit.kapila16@gmail.com>
>>> wrote:
>>> > > Okay, but isn't it better that we remove the snapshot taken
>>> > > at checkpoint time in the main branch or till where this code is
>>> > > getting back-patched.   Do you see any need of same after
>>> > > having the logging of snapshot in bgwriter?
>>> >
>>> > But this one is necessary as well to allow hot standby faster to
>>> > initialize, no? Particularly in the case where a bgwriter snapshot
>>> > would have been taken just before the checkpoint, there may be up to
>>> > 15s until the next one.
>>> >
>>>
>>> It could be helpful if checkpoint is done at shorter intervals, otherwise
>>> we anyway log it at 15s interval and if need faster initialisation
>>> of hot-standby, then it is better to reduce the log snapshot interval
>>> in bgwriter.
>>
>> No. By emitting the first snapshot directly after the determination of
>> the redo pointer, we can get into STANDBY_SNAPSHOT_READY more
>> quickly. Especially if some of the snapshots are overflowed. During
>> startup 15s can be a *long* time; but on the other hand there's not much
>> benefit at logging snapshots more frequently when the system is up.
>> I don't think we should tinker with the frequency/logging points, while
>> fixing the issue here.
>
> Agreement from here on this point. Andres, what's still remaining on
> my side is to know how to we detect in XLoginsert() how we update the
> LSN progress. I was willing to add XLogInsertExtended() with a
> complementary int/bool flag and let XLogInsert() alone as you don't
> like much doing this decision making using just the record type as I
> did in one of the patch upthread, however you did not like this idea
> either. What exactly do you have in mind?

OK, here is attached a new version that I hope addresses all the
points raised until now. The following things are changed:
- Extend XLogInsert with a new uint8 argument to have flags. As of now
there is one flag: XLOG_INSERT_NO_PROGRESS that can be set to not
update the progress. By default, the progress LSN is updated.
- Add extra check in bgwriter to not log a snapshot to be sure that no
snapshots are logged when there is no activity since last snapshot
taken, and since last checkpoint.
--
Michael

Attachment

hs-checkpoints-v6.patch

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

12 February 2016, 17:37:42

On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote:
> I'm not really a fan. I'd rather change existing callers to add a
> 'flags' bitmask argument. Right now there can't really be XLogInserts()
> in extension code, so that's pretty ok to change.

Yeah, but to what benefit?  You're just turning a smaller patch into a
bigger one and requiring churning a bunch of code that wouldn't
otherwise need to be touched.  I think Michael has a good point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Andres Freund

Date:

12 February 2016, 17:56:41

On 2016-02-12 12:37:35 -0500, Robert Haas wrote:
> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote:
> > I'm not really a fan. I'd rather change existing callers to add a
> > 'flags' bitmask argument. Right now there can't really be XLogInserts()
> > in extension code, so that's pretty ok to change.
> 
> Yeah, but to what benefit?  You're just turning a smaller patch into a
> bigger one and requiring churning a bunch of code that wouldn't
> otherwise need to be touched.  I think Michael has a good point.

It has the advantage of not ending up with an extra interface, that
we're otherwise never going to get rid of? If not now, when would we
remove it? Sure it touches a few more lines, but that's entirely trivial
mechanical changes, so what?

Andres

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

13 February 2016, 04:01:44

On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-12 12:37:35 -0500, Robert Haas wrote:
>> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote:
>> > I'm not really a fan. I'd rather change existing callers to add a
>> > 'flags' bitmask argument. Right now there can't really be XLogInserts()
>> > in extension code, so that's pretty ok to change.
>>
>> Yeah, but to what benefit?  You're just turning a smaller patch into a
>> bigger one and requiring churning a bunch of code that wouldn't
>> otherwise need to be touched.  I think Michael has a good point.
>
> It has the advantage of not ending up with an extra interface, that
> we're otherwise never going to get rid of? If not now, when would we
> remove it? Sure it touches a few more lines, but that's entirely trivial
> mechanical changes, so what?

I don't think that it's a bad thing to have a simple interface for
simple use cases and a complex one for more complex cases.  I don't
feel any need to convert every use of ReadBuffer() in the source code
to ReadBufferExtended(), for example.  Nor do I feel like we should
have changed every call to LockAcquire() instead of introducing
LockAcquireExtended().  This case seems analogous, and there are a few
advantages.  One is that it avoids creating meaningless conflicts when
back-patching.   Another is that when a function gets an extra
argument, that often turns out to be something that happens more than
once.  It's nice not to keep having to whack around every call in the
tree every time the call signature changes.  Also, finding the small
number of callers that need non-default behavior is simplified,
because you can just grep for the Extended version of the function and
there won't be many hits.

I don't feel that there's only one right way to do this, but I think
Michael's position is both reasonable and similar to what we have done
in previous cases of this sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

13 February 2016, 05:27:03

On Sat, Feb 13, 2016 at 1:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-02-12 12:37:35 -0500, Robert Haas wrote:
>>> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote:
>>> > I'm not really a fan. I'd rather change existing callers to add a
>>> > 'flags' bitmask argument. Right now there can't really be XLogInserts()
>>> > in extension code, so that's pretty ok to change.
>>>
>>> Yeah, but to what benefit?  You're just turning a smaller patch into a
>>> bigger one and requiring churning a bunch of code that wouldn't
>>> otherwise need to be touched.  I think Michael has a good point.
>>
>> It has the advantage of not ending up with an extra interface, that
>> we're otherwise never going to get rid of? If not now, when would we
>> remove it? Sure it touches a few more lines, but that's entirely trivial
>> mechanical changes, so what?

Note: the patch has grown from 15kB to 46kB by switching to the
extended interface to the addition of an argument in XLogInsert().

> I don't feel that there's only one right way to do this, but I think
> Michael's position is both reasonable and similar to what we have done
> in previous cases of this sort.

To be honest, my heart still balances for the Extended() interface.
This reduces the risk of conflicts with back-patching with 9.5.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

19 February 2016, 01:02:23

On Sat, Feb 13, 2016 at 2:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Feb 13, 2016 at 1:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote:
>>> On 2016-02-12 12:37:35 -0500, Robert Haas wrote:
>>>> On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund <andres@anarazel.de> wrote:
>>>> > I'm not really a fan. I'd rather change existing callers to add a
>>>> > 'flags' bitmask argument. Right now there can't really be XLogInserts()
>>>> > in extension code, so that's pretty ok to change.
>>>>
>>>> Yeah, but to what benefit?  You're just turning a smaller patch into a
>>>> bigger one and requiring churning a bunch of code that wouldn't
>>>> otherwise need to be touched.  I think Michael has a good point.
>>>
>>> It has the advantage of not ending up with an extra interface, that
>>> we're otherwise never going to get rid of? If not now, when would we
>>> remove it? Sure it touches a few more lines, but that's entirely trivial
>>> mechanical changes, so what?
>
> Note: the patch has grown from 15kB to 46kB by switching to the
> extended interface to the addition of an argument in XLogInsert().
>
>> I don't feel that there's only one right way to do this, but I think
>> Michael's position is both reasonable and similar to what we have done
>> in previous cases of this sort.
>
> To be honest, my heart still balances for the Extended() interface.
> This reduces the risk of conflicts with back-patching with 9.5.

Andres, others, what else can I do to make this thread move on? I can
produce any version of this patch depending on committer's
requirements.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Amit Kapila

Date:

19 February 2016, 13:33:44

On Fri, Feb 12, 2016 at 6:57 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
>
> OK, here is attached a new version that I hope addresses all the
> points raised until now. The following things are changed:
> - Extend XLogInsert with a new uint8 argument to have flags. As of now
> there is one flag: XLOG_INSERT_NO_PROGRESS that can be set to not
> update the progress. By default, the progress LSN is updated.
> - Add extra check in bgwriter to not log a snapshot to be sure that no
> snapshots are logged when there is no activity since last snapshot
> taken, and since last checkpoint.
>

You doesn't seem to have taken care of below typo in your patch as

pointed out by me earlier.

+ * to not rely on taking an exclusive lock an all the WAL insertion locks,

/an all/on all

Does this in anyway take care of the case when there is an activity

on unlogged tables?

I think avoiding to perform checkpoints in an idle system is introduced

in commit 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 whereas

unlogged relation is introduced by commit

53dbc27c62d8e1b6c5253feba04a5094cb8fe046 which is much later.

Now, I think one might consider it okay to not do anything for

unlogged tables as it is not done previously and this patch is

anyway improving the current situation, but I feel if we agree

that checkpoints will get skipped if the only operations that

are happening in the system are on unlogged relations, then

it is better to add it in comments as an improvement even if we

don't want to address it as part of this patch.

+ elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",

+ (uint32) (progress_lsn >> 32), (uint32) progress_lsn,

+ (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);

Are you proposing to have the newly intorduced elog messages

as part of commit or are these just for debugging purpose? If you

are proposing for commit, then I think it is worth to justify the need

of same and we should discuss what is the appropriate log level,

otherwise, I think you can have these in an additional patch just for

verification as the main patch is now very close to being

ready for committer.

Also, I think it is worth to once take the performance data for

write tests (using pgbench 30 minute run or some other way) with

minimum checkpoint_timeout (i.e 30s) to see if the additional locking

has any impact on performance. I think taking locks at intervals

of 15s or 30s should not matter much, but it is better to be safe.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

21 February 2016, 12:44:51

On Fri, Feb 19, 2016 at 10:33 PM, Amit Kapila wrote:
> On Fri, Feb 12, 2016 at 6:57 PM, Michael Paquier wrote:
> You doesn't seem to have taken care of below typo in your patch as
> pointed out by me earlier.
>
> + * to not rely on taking an exclusive lock an all the WAL insertion locks,
>
> /an all/on all

Sorry about that. Hopefully fixed now.

> Does this in anyway take care of the case when there is an activity
> on unlogged tables?
> I think avoiding to perform checkpoints in an idle system is introduced
> in commit 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 whereas
> unlogged relation is introduced by commit
> 53dbc27c62d8e1b6c5253feba04a5094cb8fe046 which is much later.
> Now, I think one might consider it okay to not do anything for
> unlogged tables as it is not done previously and this patch is
> anyway improving the current situation, but I feel if we agree
> that checkpoints will get skipped if the only operations that
> are happening in the system are on unlogged relations, then
> it is better to add it in comments as an improvement even if we
> don't want to address it as part of this patch.

Nope, nothing is done, just to not complicate the patch more than
necessary. I agree though that we had better not update the progress
LSN when logging something related to INIT_FORKNUM, there is little
point to run non-forced checkpoints in this case as on recovery
unlogged relations are wiped out.

> + elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt
> %X/%X",
> + (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
> + (uint32) (ControlFile->checkPoint >> 32), (uint32)
> ControlFile->checkPoint);
>
> Are you proposing to have the newly introduced elog messages
> as part of commit or are these just for debugging purpose? If you
> are proposing for commit, then I think it is worth to justify the need
> of same and we should discuss what is the appropriate log level,
> otherwise, I think you can have these in an additional patch just for
> verification as the main patch is now very close to being
> ready for committer.

I have never intended to have those logs being part of the patch that
should be committed, and putting them at LOG level avoided to have a
bunch of garbage in the logs during the tests (got lazy to grep on the
entries for those tests). I am no:t against adding a comment, here is
one I just came up with:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -448,6 +448,12 @@ typedef struct XLogwrtResult
  * hence reducing the impact of the activity lookup. This takes also
  * advantage to avoid 8-byte torn reads on some platforms by using the
  * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */

> Also, I think it is worth to once take the performance data for
> write tests (using pgbench 30 minute run or some other way) with
> minimum checkpoint_timeout (i.e 30s) to see if the additional locking
> has any impact on performance.  I think taking locks at intervals
> of 15s or 30s should not matter much, but it is better to be safe.

I don't think performance will be impacted, but there are no reasons
to not do any measurements either. I'll try to get some graphs
tomorrow with runs on my laptop, mainly looking for any effects of
this patch on TPS when checkpoints show up.

For now attached is an updated patch, with a second patch (b)
including the logs for testing.
--
Michael

Attachment

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Robert Haas

Date:

23 February 2016, 13:31:04

On Fri, Feb 19, 2016 at 6:32 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> To be honest, my heart still balances for the Extended() interface.
>> This reduces the risk of conflicts with back-patching with 9.5.
>
> Andres, others, what else can I do to make this thread move on? I can
> produce any version of this patch depending on committer's
> requirements.

FWIW, I don't expect to have time to review this in the level of
detail that would make me confident to commit it any time soon.  I've
said my piece on what I think the final patch should look like, and I
hope that argument was persuasive, but I don't have anything further
to add to what I already said.  I hope some other committer has some
cycles to look at this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

24 February 2016, 05:40:53

On Tue, Feb 23, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 19, 2016 at 6:32 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> To be honest, my heart still balances for the Extended() interface.
>>> This reduces the risk of conflicts with back-patching with 9.5.
>>
>> Andres, others, what else can I do to make this thread move on? I can
>> produce any version of this patch depending on committer's
>> requirements.
>
> FWIW, I don't expect to have time to review this in the level of
> detail that would make me confident to commit it any time soon.  I've
> said my piece on what I think the final patch should look like, and I
> hope that argument was persuasive, but I don't have anything further
> to add to what I already said.  I hope some other committer has some
> cycles to look at this.

This has the merit to be clear, thanks for the input. Whatever the
approach taken at the end we have two candidates:
- Extend XLogInsert() with an extra argument for flags (Andres)
- Introduce XLogInsertExtended with this extra argument and let
XLogInsert() in peace (Robert and I).
Actually, I lied, there was still something I could do for this
thread: attached are two patches implementing both approaches as
respectively a-1 and a-2. Patch b is the set of logs I used for the
tests to show with a low checkpoint_timeout that checkpoints are
getting correctly skipped on an idle system.

Let's see what happens next.
--
Michael

Attachment

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

29 February 2016, 06:43:10

On Wed, Feb 24, 2016 at 2:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> This has the merit to be clear, thanks for the input. Whatever the
> approach taken at the end we have two candidates:
> - Extend XLogInsert() with an extra argument for flags (Andres)
> - Introduce XLogInsertExtended with this extra argument and let
> XLogInsert() in peace (Robert and I).
> Actually, I lied, there was still something I could do for this
> thread: attached are two patches implementing both approaches as
> respectively a-1 and a-2. Patch b is the set of logs I used for the
> tests to show with a low checkpoint_timeout that checkpoints are
> getting correctly skipped on an idle system.
>
> Let's see what happens next.

FWIW, I just made a couple of 5-min pgbench runs on my laptop (scale
100, 24 clients using 4 threads) with checkpoint_timeout = 30s and a
small value of shared_buffers to minimize the work of each checkpoint,
and I am seeing similar spikes with variations that can be assimilated
to noise. See for example the graph attached. Trying to put more
contention on the extra locks taken before entering the checkpoint
section to scan the progress LSN looks kind of difficult :) But that's
not really a surprise.
--
Michael

Attachment

results2.jpeg

Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

David Steele

Date:

14 March 2016, 17:46:48

On 2/24/16 12:40 AM, Michael Paquier wrote:

> This has the merit to be clear, thanks for the input. Whatever the
> approach taken at the end we have two candidates:
> - Extend XLogInsert() with an extra argument for flags (Andres)
> - Introduce XLogInsertExtended with this extra argument and let
> XLogInsert() in peace (Robert and I).
> Actually, I lied, there was still something I could do for this
> thread: attached are two patches implementing both approaches as
> respectively a-1 and a-2. Patch b is the set of logs I used for the
> tests to show with a low checkpoint_timeout that checkpoints are
> getting correctly skipped on an idle system.

Unfortunately neither A nor B apply anymore.

However, since the patches can still be read through I wonder if Robert 
or Andres would care to opine on whether A or B is better now that they 
can see the full implementation?

For my 2c I'm happy with XLogInsertExtended() since it seems to be a 
rare use case where flags are required.  This can always be refactored 
in the future if/when the use of flags spreads.

I think it would be good to make a decision on this before asking 
Michael to rebase.

Thanks,
-- 
-David
david@pgmasters.net

Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

15 March 2016, 17:11:10

On Mon, Mar 14, 2016 at 6:46 PM, David Steele <david@pgmasters.net> wrote:
> On 2/24/16 12:40 AM, Michael Paquier wrote:
>
>> This has the merit to be clear, thanks for the input. Whatever the
>> approach taken at the end we have two candidates:
>> - Extend XLogInsert() with an extra argument for flags (Andres)
>> - Introduce XLogInsertExtended with this extra argument and let
>> XLogInsert() in peace (Robert and I).
>> Actually, I lied, there was still something I could do for this
>> thread: attached are two patches implementing both approaches as
>> respectively a-1 and a-2. Patch b is the set of logs I used for the
>> tests to show with a low checkpoint_timeout that checkpoints are
>> getting correctly skipped on an idle system.
>
>
> Unfortunately neither A nor B apply anymore.
>
> However, since the patches can still be read through I wonder if Robert or
> Andres would care to opine on whether A or B is better now that they can see
> the full implementation?
>
> For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare
> use case where flags are required.  This can always be refactored in the
> future if/when the use of flags spreads.
>
> I think it would be good to make a decision on this before asking Michael to
> rebase.

That's a bit embarrassing, the last versions should be able to apply
cleanly as there have not been changes in this area of the code
lately... But... I did a mistake when generating the patches by
diff'ing them from an incorrect commit number... This explains why
they exploded in size, so attached are the corrected rebased versions.
Too many patches I guess.. And both of them are attached by the way.
--
Michael

Attachment

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Simon Riggs

Date:

03 April 2016, 20:32:39

On 14 March 2016 at 17:46, David Steele <david@pgmasters.net> wrote:

On 2/24/16 12:40 AM, Michael Paquier wrote:

This has the merit to be clear, thanks for the input. Whatever the
approach taken at the end we have two candidates:
- Extend XLogInsert() with an extra argument for flags (Andres)
- Introduce XLogInsertExtended with this extra argument and let
XLogInsert() in peace (Robert and I).
Actually, I lied, there was still something I could do for this
thread: attached are two patches implementing both approaches as
respectively a-1 and a-2. Patch b is the set of logs I used for the
tests to show with a low checkpoint_timeout that checkpoints are
getting correctly skipped on an idle system.

Unfortunately neither A nor B apply anymore.

However, since the patches can still be read through I wonder if Robert or Andres would care to opine on whether A or B is better now that they can see the full implementation?

For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare use case where flags are required. This can always be refactored in the future if/when the use of flags spreads.

XLogInsertExtended() is the one I would commit, if.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Simon Riggs

Date:

03 April 2016, 21:41:58

On 3 April 2016 at 21:32, Simon Riggs <simon@2ndquadrant.com> wrote:

On 14 March 2016 at 17:46, David Steele <david@pgmasters.net> wrote:
On 2/24/16 12:40 AM, Michael Paquier wrote:

This has the merit to be clear, thanks for the input. Whatever the
approach taken at the end we have two candidates:
- Extend XLogInsert() with an extra argument for flags (Andres)
- Introduce XLogInsertExtended with this extra argument and let
XLogInsert() in peace (Robert and I).
Actually, I lied, there was still something I could do for this
thread: attached are two patches implementing both approaches as
respectively a-1 and a-2. Patch b is the set of logs I used for the
tests to show with a low checkpoint_timeout that checkpoints are
getting correctly skipped on an idle system.

Unfortunately neither A nor B apply anymore.

However, since the patches can still be read through I wonder if Robert or Andres would care to opine on whether A or B is better now that they can see the full implementation?

For my 2c I'm happy with XLogInsertExtended() since it seems to be a rare use case where flags are required. This can always be refactored in the future if/when the use of flags spreads.

XLogInsertExtended() is the one I would commit, if.

I'm not very excited about this patch. Too much code for so little benefit and fragile too.

I'm not even sure what definition of "meaningful progress" is useful. If we commit this, a similar bug could be filed for many similar WAL record types.

Since we zero out WAL files specifically to allow them to be compressed in the archive, this patch saves a few bytes in the archive. Seems like we could fake up some WAL files if needed for restore with an external tool, if we really care.

Since we agree it wouldn't be backpatched, its pretty much saying we don't care.

A promising approach would be to just skip logging the RUNNING_XACTS record if there are no xacts running, which is the case we care about here (no progress => no xacts). HS startup can only happen at a checkpoint, so if there is no WAL file there is no checkpoint, so we don't need the RUNNING_XACTS record to be written. That's a short patch.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

04 April 2016, 06:22:27

On Mon, Apr 4, 2016 at 6:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'm not very excited about this patch. Too much code for so little benefit
> and fragile too.
>
> I'm not even sure what definition of "meaningful progress" is useful. If we
> commit this, a similar bug could be filed for many similar WAL record types.

Up to now, "progress" means the addition of WAL records by backends,
"meaningful" refers to records that impact the decision of skipping
checkpoints or not.

> Since we zero out WAL files specifically to allow them to be compressed in
> the archive, this patch saves a few bytes in the archive. Seems like we
> could fake up some WAL files if needed for restore with an external tool, if
> we really care.

This assumes that most deployments compress the segments when
archiving them for simplicity. I would believe that the reverse is
true, most of archive command processes do not compress them. And I am
pretty sure that some cloud deployments have as well pricing models
depending on the number of accesses done on an instance.

> Since we agree it wouldn't be backpatched, its pretty much saying we don't
> care.

Seeing the flow of this thread, that's not completely true. Andres has
been arguing for no backpatch, Amit and I not, because the checkpoint
skip logic is incorrect. The logic to decide is a checkpoint should be
skipped or not is still broken on back-branches for wal_level =
hot_standby. That's related to the former complain reported here.

> A promising approach would be to just skip logging the RUNNING_XACTS record
> if there are no xacts running, which is the case we care about here (no
> progress => no xacts). HS startup can only happen at a checkpoint, so if
> there is no WAL file there is no checkpoint, so we don't need the
> RUNNING_XACTS record to be written. That's a short patch.

I am not sure if that's actually simpler... Do you mean here that we
should save the latest transaction ID at the moment a snapshot is
taken and skip the next snapshot if TXID has not progressed? Say in
LogStandbySnapshot() when a snapshot is taken, we allocate the last
XID with ReadNewTransactionId() in XlogCtlData, then compare it the
next time a standby snapshot is taken. With such a logic, you need to
do something like that:
- Save the last transaction XID when RUNNING_XACTS has been logged in shmem
- Save in shmem as well if the previous RUNNING_XACTS has overflowed
its subxids or not. If it has overflowed, we should log RUNNING_XACTS
next time LogStandbySnapshot is logged to prevent the case of pending
snapshots at recovery.

AFAIK, standby snapshots with no xacts running are still useful at
recovery for two things to accelerate a hot standby initialization:
- If the previous snapshot overflowed its subxids, the next empty
snapshot can help out clear the situation. That's the case where the
bgwriter snapshot is helpful actually, because there is no need to
wait for the next checkpoint record to be replayed to initialize a hot
standby.
- An empty snapshot after checkpoint is mandatory, even if it is
empty, no? Knowing that, the next checkpoint cannot be skipped because
the logic mentioned above in xlog.c to decide if a checkpoint should
be skipped or not ignores the fact that a standby snapshot has been
logged. It bases its logic on the sole presence of a checkpoint
record, so this will never work. I mean this check of course:       if ((flags & (CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_END_OF_RECOVERY|                                 CHECKPOINT_FORCE)) == 0)       {               if (prevPtr
==ControlFile->checkPointCopy.redo &&                       prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
       {                       WALInsertLockRelease();                       LWLockRelease(CheckpointLock);
         END_CRIT_SECTION();                       return;               }       }
 
The approach introducing the concept of WAL progress addresses the
problem of unnecessary checkpoints and of unnecessary standby
snapshots. If we take the problem only from the angle of RUNNING_XACTS
the current logic used to check if a checkpoint should be skipped or
not is not addressed.
-- 
Michael

Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From

Michael Paquier

Date:

20 April 2016, 07:53:40

On Mon, Apr 4, 2016 at 3:22 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> The approach introducing the concept of WAL progress addresses the
> problem of unnecessary checkpoints and of unnecessary standby
> snapshots. If we take the problem only from the angle of RUNNING_XACTS
> the current logic used to check if a checkpoint should be skipped or
> not is not addressed.

This discussion has been stalling for a while regarding the main
patches... Attached are rebased versions based on the approach with
XLogInsertExtended(). And I have added as well a couple of days ago an
entry on the 9.6 open item page, not as a 9.6 open item, but as an
older bug to keep at least track of this problem.
--
Michael