Thread: standby apply lag on inactive servers

standby apply lag on inactive servers

From
Alvaro Herrera
Date:
A customer of ours complained that if you have an inactive primary,
monitoring the apply lag on a standby reports monotonically increasing
lag.  The reason for this is that the apply lag is only updated on
COMMIT records, which of course don't occur in inactive servers.
But CHECKPOINT records do occur, so the WAL insert pointer continues to
move forward, which is what causes the spurious lag.

(I think newer releases are protected from this problem because they
don't emit checkpoints during periods of inactivity.  I didn't verify
this.)

This patch fixes the problem by using the checkpoint timestamp to update
the lag tracker in the standby.  This requires a little change in where
this update is invoked, because previously it was done only for the XACT
rmgr; this makes the patch a little bigger than it should.

-- 
Álvaro Herrera                PostgreSQL Expert, https://www.2ndQuadrant.com/

Attachment

Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-10, Alvaro Herrera wrote:

> A customer of ours complained that if you have an inactive primary,
> monitoring the apply lag on a standby reports monotonically increasing
> lag.  The reason for this is that the apply lag is only updated on
> COMMIT records, which of course don't occur in inactive servers.
> But CHECKPOINT records do occur, so the WAL insert pointer continues to
> move forward, which is what causes the spurious lag.
> 
> (I think newer releases are protected from this problem because they
> don't emit checkpoints during periods of inactivity.  I didn't verify
> this.)
> 
> This patch fixes the problem by using the checkpoint timestamp to update
> the lag tracker in the standby.  This requires a little change in where
> this update is invoked, because previously it was done only for the XACT
> rmgr; this makes the patch a little bigger than it should.

Here's a version of the patch that applies to current master.  It does
fix the problem that CHECKPOINT wal records are not considered when
determining time-of-latest-record.

However, it does *not* fix the monitoring problem I mentioned (which
relied on comparing pg_last_xact_replay_timestamp() to 'now()') ...
because commit 6ef2eba3f57f (pg10) made an idle server not emit
checkpoint records anymore.  That is, my parenthical remark was
completely wrong: the new versions not only are "protected", but also
this fix doesn't fix them.  Luckily, the way to fix monitoring for
servers of versions 10 and later is to use the new replay_lag (etc)
columns in pg_stat_replication, commit 6912acc04f0b (also pg10).

I am inclined to apply this to all branches unless there are strong
objections, because the current code seems pretty arbitrary anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
Actually looking again, getRecordTimestamp is looking pretty strange.
It looks much more natural by using nested switch/case blocks, as with
this diff.  I think the compiler does a better job this way too.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standby apply lag on inactive servers

From
Kyotaro Horiguchi
Date:
Hello.

At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> Actually looking again, getRecordTimestamp is looking pretty strange.
> It looks much more natural by using nested switch/case blocks, as with
> this diff.  I think the compiler does a better job this way too.

Agreed.  Anyway I looked the latest version.

The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
could be included for consistency.)

The timestamp of a checkpoint record is the start time of a checkpoint
(and doesn't have subseconds part, but it's not a problem.). That
means that the timestamp is usually far behind the time at the record
has been inserted. As the result the stored timestamp can go back by a
significant internval. I don't think that causes an actual problem but
the movement looks wierd as the result of
pg_last_xact_replay_timestamp().

Asides from the backward movement, a timestamp from other than
commit/abort records in recvoeryLastXTime affects the following code.

xlog.c:7329: (similar code exists at line 9332)
>    ereport(LOG,
>            (errmsg("redo done at %X/%X",
>                    (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
>    xtime = GetLatestXTime();
>    if (xtime)
>        ereport(LOG,
>                (errmsg("last completed transaction was at log time %s",
>                        timestamptz_to_str(xtime))));

This code assumes (and the name GetLatestXTime() suggests, I first
noticed that here..) that the timestamp comes from commit/abort logs,
so otherwise it shows a wrong timestamp.  We shouldn't update the
variable by other than that kind of records.


If (I don't think that comes true..) we set the timestamp from other
than that kind of record, the names and the comments of the functions
should be changed.

> /*
>  * Save timestamp of latest processed commit/abort record.
>  *
>  * We keep this in XLogCtl, not a simple static variable, so that it can be
>  * seen by processes other than the startup process.  Note in particular
>  * that CreateRestartPoint is executed in the checkpointer.
>  */
> static void
> SetLatestXTime(TimestampTz xtime)
...
> /*
>  * Fetch timestamp of latest processed commit/abort record.
>  */
> TimestampTz
> GetLatestXTime(void)


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-28, Kyotaro Horiguchi wrote:

> At Mon, 27 Jan 2020 18:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > Actually looking again, getRecordTimestamp is looking pretty strange.
> > It looks much more natural by using nested switch/case blocks, as with
> > this diff.  I think the compiler does a better job this way too.
> 
> Agreed.  Anyway I looked the latest version.
> 
> The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
> XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
> could be included for consistency.)

Hello, thanks for looking.

> The timestamp of a checkpoint record is the start time of a checkpoint
> (and doesn't have subseconds part, but it's not a problem.). That
> means that the timestamp is usually far behind the time at the record
> has been inserted. As the result the stored timestamp can go back by a
> significant internval. I don't think that causes an actual problem but
> the movement looks wierd as the result of
> pg_last_xact_replay_timestamp().

Ouch ... yeah, it should be set only if it doesn't go backwards.

> xlog.c:7329: (similar code exists at line 9332)
> >    ereport(LOG,
> >            (errmsg("redo done at %X/%X",
> >                    (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
> >    xtime = GetLatestXTime();
> >    if (xtime)
> >        ereport(LOG,
> >                (errmsg("last completed transaction was at log time %s",
> >                        timestamptz_to_str(xtime))));
> 
> This code assumes (and the name GetLatestXTime() suggests, I first
> noticed that here..) that the timestamp comes from commit/abort logs,
> so otherwise it shows a wrong timestamp.  We shouldn't update the
> variable by other than that kind of records.

Hmm, that's terrible.  GetLatestXTime() being displayed user-visibly for
"last transaction completion" but having it include unrelated things
such as restore points is terrible.  One idea is to should split it in
two: one exclusively for transaction commit/abort, and another for all
WAL activity.  That way, the former can be used for that message, and
the latter for standby replay reports.  However, that might be
overengineering, if the only thing that the former is that one LOG
message; instead changing the log message to state that the time is for
other activity, as you suggest, is simpler and has no downside that I
can see.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-27, Alvaro Herrera wrote:

> Actually looking again, getRecordTimestamp is looking pretty strange.
> It looks much more natural by using nested switch/case blocks, as with
> this diff.  I think the compiler does a better job this way too.

I hadn't noticed I forgot to attach the diff here :-(

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: standby apply lag on inactive servers

From
Kyotaro Horiguchi
Date:
At Tue, 28 Jan 2020 11:18:50 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Jan-27, Alvaro Herrera wrote:
> 
> > Actually looking again, getRecordTimestamp is looking pretty strange.
> > It looks much more natural by using nested switch/case blocks, as with
> > this diff.  I think the compiler does a better job this way too.
> 
> I hadn't noticed I forgot to attach the diff here :-(

Yeay, that patch bases the apply-lag patch:) And contains
XLOG_CHECKPOINT_*. But otherwise looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby apply lag on inactive servers

From
Kyotaro Horiguchi
Date:
At Tue, 28 Jan 2020 11:18:14 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > xlog.c:7329: (similar code exists at line 9332)
> > >    ereport(LOG,
> > >            (errmsg("redo done at %X/%X",
> > >                    (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
> > >    xtime = GetLatestXTime();
> > >    if (xtime)
> > >        ereport(LOG,
> > >                (errmsg("last completed transaction was at log time %s",
> > >                        timestamptz_to_str(xtime))));
> > 
> > This code assumes (and the name GetLatestXTime() suggests, I first
> > noticed that here..) that the timestamp comes from commit/abort logs,
> > so otherwise it shows a wrong timestamp.  We shouldn't update the
> > variable by other than that kind of records.
> 
> Hmm, that's terrible.  GetLatestXTime() being displayed user-visibly for
> "last transaction completion" but having it include unrelated things
> such as restore points is terrible.  One idea is to should split it in
> two: one exclusively for transaction commit/abort, and another for all
> WAL activity.  That way, the former can be used for that message, and
> the latter for standby replay reports.  However, that might be
> overengineering, if the only thing that the former is that one LOG
> message; instead changing the log message to state that the time is for
> other activity, as you suggest, is simpler and has no downside that I
> can see.

Perhaps we can use ControlData->checkPointCopy.time instead. It misses
checkpoint records intermittently but works in general.

But as more significant issue, nowadays PostgreSQL doesn't run a
checkpoint if it is really inactive (that is, if no "important" WAL
records have issued.).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-29, Kyotaro Horiguchi wrote:

> But as more significant issue, nowadays PostgreSQL doesn't run a
> checkpoint if it is really inactive (that is, if no "important" WAL
> records have issued.).

Yeah, I mentioned this in message
<20200127203419.GA15216@alvherre.pgsql>.  The solution for monitoring
purposes is to use the new "lag" columns in pg_stat_replication.  But
that's not available in older releases (prior to 10), so this change is
still useful.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-28, Kyotaro Horiguchi wrote:

> The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
> XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
> could be included for consistency.)

Hmm I think I should definitely include those.

> The timestamp of a checkpoint record is the start time of a checkpoint
> (and doesn't have subseconds part, but it's not a problem.). That
> means that the timestamp is usually far behind the time at the record
> has been inserted. As the result the stored timestamp can go back by a
> significant internval. I don't think that causes an actual problem but
> the movement looks wierd as the result of
> pg_last_xact_replay_timestamp().

A problem I see with this is that setting the timestamp is done with
XLogCtl->lock held; since now we need to make the update conditional,
we'd have to read the current value, compare with the checkpoint time,
then set, all with the spinlock held.  That seems way too expensive.

A compromise might be to do the compare only when it's done for
checkpoint.  These occur seldom enough that it shouldn't be a problem
(as opposed to commit records, which can be very frequent).

> Asides from the backward movement, a timestamp from other than
> commit/abort records in recvoeryLastXTime affects the following code.
> 
> xlog.c:7329: (similar code exists at line 9332)
> >    ereport(LOG,
> >            (errmsg("redo done at %X/%X",
> >                    (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
> >    xtime = GetLatestXTime();
> >    if (xtime)
> >        ereport(LOG,
> >                (errmsg("last completed transaction was at log time %s",
> >                        timestamptz_to_str(xtime))));
> 
> This code assumes (and the name GetLatestXTime() suggests, I first
> noticed that here..) that the timestamp comes from commit/abort logs,
> so otherwise it shows a wrong timestamp.  We shouldn't update the
> variable by other than that kind of records.

Thinking about this some more, I think we should do keep the message the
same backbranches (avoid breaking anything that might be reading the log
-- a remote but not inexistent possibility), and adjust the message in
master to be something like "last timestamped WAL activity at time %s",
and document that it means commit, abort, restore label, checkpoint.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standby apply lag on inactive servers

From
Kyotaro Horiguchi
Date:
At Wed, 29 Jan 2020 19:11:31 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Jan-28, Kyotaro Horiguchi wrote:
> 
> > The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
> > XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
> > could be included for consistency.)
> 
> Hmm I think I should definitely include those.

I agree to that, given the following change of log messages.

> > The timestamp of a checkpoint record is the start time of a checkpoint
> > (and doesn't have subseconds part, but it's not a problem.). That
> > means that the timestamp is usually far behind the time at the record
> > has been inserted. As the result the stored timestamp can go back by a
> > significant internval. I don't think that causes an actual problem but
> > the movement looks wierd as the result of
> > pg_last_xact_replay_timestamp().
> 
> A problem I see with this is that setting the timestamp is done with
> XLogCtl->lock held; since now we need to make the update conditional,
> we'd have to read the current value, compare with the checkpoint time,
> then set, all with the spinlock held.  That seems way too expensive.
> 
> A compromise might be to do the compare only when it's done for
> checkpoint.  These occur seldom enough that it shouldn't be a problem
> (as opposed to commit records, which can be very frequent).

I think we don't need to that, given the following change.

> > Asides from the backward movement, a timestamp from other than
> > commit/abort records in recvoeryLastXTime affects the following code.
> > 
> > xlog.c:7329: (similar code exists at line 9332)
> > >    ereport(LOG,
> > >            (errmsg("redo done at %X/%X",
> > >                    (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
> > >    xtime = GetLatestXTime();
> > >    if (xtime)
> > >        ereport(LOG,
> > >                (errmsg("last completed transaction was at log time %s",
> > >                        timestamptz_to_str(xtime))));
> > 
> > This code assumes (and the name GetLatestXTime() suggests, I first
> > noticed that here..) that the timestamp comes from commit/abort logs,
> > so otherwise it shows a wrong timestamp.  We shouldn't update the
> > variable by other than that kind of records.
> 
> Thinking about this some more, I think we should do keep the message the
> same backbranches (avoid breaking anything that might be reading the log
> -- a remote but not inexistent possibility), and adjust the message in
> master to be something like "last timestamped WAL activity at time %s",
> and document that it means commit, abort, restore label, checkpoint.

Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?

"transactions are completed up to log time %s"

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-30, Kyotaro Horiguchi wrote:

> Agreed about backbranches. I'd like to preserve the word "transaction"
> as it is more familiar to users. How about something like the follows?
> 
> "transactions are completed up to log time %s"

That's a good point.  I used the phrase "transaction activity", which
seems sufficiently explicit to me.

So, the attached is the one for master; in back branches I would use the
same (plus minor conflict fixes), except that I would drop the message
wording changes.

Thanks for the reviews so far,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: standby apply lag on inactive servers

From
Kyotaro Horiguchi
Date:
At Thu, 30 Jan 2020 17:45:36 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Jan-30, Kyotaro Horiguchi wrote:
> 
> > Agreed about backbranches. I'd like to preserve the word "transaction"
> > as it is more familiar to users. How about something like the follows?
> > 
> > "transactions are completed up to log time %s"
> 
> That's a good point.  I used the phrase "transaction activity", which
> seems sufficiently explicit to me.
> 
> So, the attached is the one for master; in back branches I would use the
> same (plus minor conflict fixes), except that I would drop the message
> wording changes.
> 
> Thanks for the reviews so far,

My pleasure.

regads.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby apply lag on inactive servers

From
Fujii Masao
Date:

On 2020/01/31 5:45, Alvaro Herrera wrote:
> On 2020-Jan-30, Kyotaro Horiguchi wrote:
> 
>> Agreed about backbranches. I'd like to preserve the word "transaction"
>> as it is more familiar to users. How about something like the follows?
>>
>> "transactions are completed up to log time %s"
> 
> That's a good point.  I used the phrase "transaction activity", which
> seems sufficiently explicit to me.
> 
> So, the attached is the one for master; in back branches I would use the
> same (plus minor conflict fixes), except that I would drop the message
> wording changes.

You're thinking to apply this change to the back branches? Sorry
if my understanding is not right. But I don't think that back-patch
is ok because it changes the documented existing behavior
of pg_last_xact_replay_timestamp(). So it looks like the behavior
change not a bug fix.

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-31, Fujii Masao wrote:

> You're thinking to apply this change to the back branches? Sorry
> if my understanding is not right. But I don't think that back-patch
> is ok because it changes the documented existing behavior
> of pg_last_xact_replay_timestamp(). So it looks like the behavior
> change not a bug fix.

Yeah, I am thinking in backpatching it.  The documented behavior is
already not what the code does.  Do you have a situation where this
change would break something?  If so, can you please explain what it is?

I think (and I said it upthread) a 100% complete fix involves tracking
two timestamps rather than one.  I was thinking that that would be too
invasive because it changes XLogCtlData shmem struct ... but that struct
is private to xlog.c, so I think it's fine to change the struct.  The
problem though is that the user-visible change that I want to achieve is
pg_last_xact_replay_timestamp(), and it would be obviously wrong to use
the new XLogCtlData field rather than the existing one, as that would be
a behavior change in the same sense that you're now complaining about.
So I would achieve nothing.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standby apply lag on inactive servers

From
Fujii Masao
Date:

On 2020/01/31 22:40, Alvaro Herrera wrote:
> On 2020-Jan-31, Fujii Masao wrote:
> 
>> You're thinking to apply this change to the back branches? Sorry
>> if my understanding is not right. But I don't think that back-patch
>> is ok because it changes the documented existing behavior
>> of pg_last_xact_replay_timestamp(). So it looks like the behavior
>> change not a bug fix.
> 
> Yeah, I am thinking in backpatching it.  The documented behavior is
> already not what the code does.

Maybe you thought this because getRecordTimestamp() extracts the
timestamp from even WAL record of a restore point? That is, you're
concerned about that pg_last_xact_replay_timestamp() returns the
timestamp of not only commit/abort record but also restore point one.
Right?

As far as I read the code, this problem doesn't occur because
SetLatestXTime() is called only for commit/abort records, in
recoveryStopsAfter(). No?

>  Do you have a situation where this
> change would break something?  If so, can you please explain what it is?

For example, use the return value of pg_last_xact_replay_timestamp()
(and also the timestamp in the log message output at the end of
recovery) as a HINT when setting recovery_target_time later.
Use it to compare with the timestamp retrieved from the master server,
in order to monitor the replication delay.

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters



Re: standby apply lag on inactive servers

From
Alvaro Herrera
Date:
On 2020-Jan-31, Fujii Masao wrote:
> On 2020/01/31 22:40, Alvaro Herrera wrote:
> > On 2020-Jan-31, Fujii Masao wrote:
> > 
> > > You're thinking to apply this change to the back branches? Sorry
> > > if my understanding is not right. But I don't think that back-patch
> > > is ok because it changes the documented existing behavior
> > > of pg_last_xact_replay_timestamp(). So it looks like the behavior
> > > change not a bug fix.
> > 
> > Yeah, I am thinking in backpatching it.  The documented behavior is
> > already not what the code does.
> 
> Maybe you thought this because getRecordTimestamp() extracts the
> timestamp from even WAL record of a restore point? That is, you're
> concerned about that pg_last_xact_replay_timestamp() returns the
> timestamp of not only commit/abort record but also restore point one.
> Right?

right.

> As far as I read the code, this problem doesn't occur because
> SetLatestXTime() is called only for commit/abort records, in
> recoveryStopsAfter(). No?

... uh, wow, you're right about that too.  IMO this is extremely
fragile, easy to break, and under-documented.  But you're right, there's
no bug there at present.

> >  Do you have a situation where this
> > change would break something?  If so, can you please explain what it is?
> 
> For example, use the return value of pg_last_xact_replay_timestamp()
> (and also the timestamp in the log message output at the end of
> recovery) as a HINT when setting recovery_target_time later.

Hmm.

I'm not sure how you would use it in that way.  I mean, I understand how
it *can* be used that way, but it seems too fragile to be done in
practice, in a scenario that's not just laboratory games.

> Use it to compare with the timestamp retrieved from the master server,
> in order to monitor the replication delay.

That's precisely the use case that I'm aiming at.  The timestamp
currently is not useful because this usage breaks when the primary is
inactive (no COMMIT records occur).  During such periods of inactivity,
CHECKPOINT records would keep the "last xtime" current.  This has
actually happened in a production setting, it's not a thought
experiment.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: standby apply lag on inactive servers

From
Fujii Masao
Date:

On 2020/01/31 23:47, Alvaro Herrera wrote:
> On 2020-Jan-31, Fujii Masao wrote:
>> On 2020/01/31 22:40, Alvaro Herrera wrote:
>>> On 2020-Jan-31, Fujii Masao wrote:
>>>
>>>> You're thinking to apply this change to the back branches? Sorry
>>>> if my understanding is not right. But I don't think that back-patch
>>>> is ok because it changes the documented existing behavior
>>>> of pg_last_xact_replay_timestamp(). So it looks like the behavior
>>>> change not a bug fix.
>>>
>>> Yeah, I am thinking in backpatching it.  The documented behavior is
>>> already not what the code does.
>>
>> Maybe you thought this because getRecordTimestamp() extracts the
>> timestamp from even WAL record of a restore point? That is, you're
>> concerned about that pg_last_xact_replay_timestamp() returns the
>> timestamp of not only commit/abort record but also restore point one.
>> Right?
> 
> right.
> 
>> As far as I read the code, this problem doesn't occur because
>> SetLatestXTime() is called only for commit/abort records, in
>> recoveryStopsAfter(). No?
> 
> ... uh, wow, you're right about that too.  IMO this is extremely
> fragile, easy to break, and under-documented.

Yeah, it's worth improving the code.

> But you're right, there's
> no bug there at present.
> 
>>>   Do you have a situation where this
>>> change would break something?  If so, can you please explain what it is?
>>
>> For example, use the return value of pg_last_xact_replay_timestamp()
>> (and also the timestamp in the log message output at the end of
>> recovery) as a HINT when setting recovery_target_time later.
> 
> Hmm.
> 
> I'm not sure how you would use it in that way.  I mean, I understand how
> it *can* be used that way, but it seems too fragile to be done in
> practice, in a scenario that's not just laboratory games.
> 
>> Use it to compare with the timestamp retrieved from the master server,
>> in order to monitor the replication delay.
> 
> That's precisely the use case that I'm aiming at.  The timestamp
> currently is not useful because this usage breaks when the primary is
> inactive (no COMMIT records occur).  During such periods of inactivity,
> CHECKPOINT records would keep the "last xtime" current.  This has
> actually happened in a production setting, it's not a thought
> experiment.

I've heard that someone periodically generates dummy tiny
transactions (say, every minute), as a band-aid solution,
to avoid inactive primary. Of course, this is not a perfect solution.

The idea that I proposed previously was to introduce
pg_last_xact_insert_timestamp() [1] into core. This function returns
the timestamp of commit / abort records in *primary* side.
So we can retrieve that timestamp from the primary (e.g., by using dblink)
and compare its result with pg_last_xact_replay_timestamp() to
calculate the delay in the standby.

Another idea is to include the commit / abort timestamp in
primary-keepalive-message that periodially sent from the primary
to the standby. Then if we introduce the function returning
that timestamp, in the standby side, we can easily compare
the commit / abort timestamps taken from both primary and
standby, in the standby.

[1] https://www.postgresql.org/message-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters