Thread: Fetching timeline during recovery

Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
Hello,

Fetching the timeline from a standby could be useful in various situation.
Either for backup tools [1] or failover tools during some kind of election
process.

Please, find in attachment a first trivial patch to support pg_walfile_name()
and pg_walfile_name_offset() on a standby.

Previous restriction on this functions seems related to ThisTimeLineID not
being safe on standby. This patch is fetching the timeline from
WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
this is updated each time some data are flushed to the WAL. 

As the SQL function pg_last_wal_receive_lsn() reads WalRcv->receivedUpto
which is flushed in the same time, any tool relying on these functions should be
quite fine. It will just have to parse the TL from the walfile name.

It doesn't seems perfectly sain though. I suspect a race condition in any SQL
statement that would try to get the LSN and the walfile name in the same time
if the timeline changes in the meantime. Ideally, a function should be able to
return both LSN and TL in the same time, with only one read from WalRcv. I'm not
sure if I should change the result from pg_last_wal_receive_lsn() or add a
brand new admin function. Any advice?

Last, I plan to produce an extension to support this on older release. Is
it something that could be integrated in official source tree during a minor
release or should I publish it on eg. pgxn?

Regards,

[1]
https://www.postgresql.org/message-id/flat/BF2AD4A8-E7F5-486F-92C8-A6959040DEB6%40yandex-team.ru

Attachment

Re: Fetching timeline during recovery

From
Andrey Borodin
Date:

> 23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> написал(а):
>
> Fetching the timeline from a standby could be useful in various situation.
> Either for backup tools [1] or failover tools during some kind of election
> process.
That backup tool is reading timeline from pg_control_checkpoint(). And formats WAL file name itself when necessary.

> Please, find in attachment a first trivial patch to support pg_walfile_name()
> and pg_walfile_name_offset() on a standby.

You just cannot format WAL file name for LSN when timeline changed. Because there are at least three WALs for that
point:previous, new and partial. However, reading TLI from checkpoint seems safe for backup purposes. 
The only reason for WAL-G to read that timeline is to mark backup invalid: if it's name is
base_00000001XXXXXXXXYY00000YYand timeline change happens, it should be named base_00000002XXXXXXXXYY00000YY
(consistencypoint is not on TLI 2), but WAL-G cannot rename backup during backup-push. 

Hope this information is useful. Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/wal-g/wal-g/blob/master/internal/timeline.go#L39


Re: Fetching timeline during recovery

From
David Steele
Date:
On 7/23/19 2:59 PM, Andrey Borodin wrote:
> 
>> 23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> написал(а):
>>
>> Fetching the timeline from a standby could be useful in various situation.
>> Either for backup tools [1] or failover tools during some kind of election
>> process.
> That backup tool is reading timeline from pg_control_checkpoint(). And formats WAL file name itself when necessary.

We do the same [1].

>> Please, find in attachment a first trivial patch to support pg_walfile_name()
>> and pg_walfile_name_offset() on a standby.
> 
> You just cannot format WAL file name for LSN when timeline changed. Because there are at least three WALs for that
point:previous, new and partial. However, reading TLI from checkpoint seems safe for backup purposes.
 
> The only reason for WAL-G to read that timeline is to mark backup invalid: if it's name is
base_00000001XXXXXXXXYY00000YYand timeline change happens, it should be named base_00000002XXXXXXXXYY00000YY
(consistencypoint is not on TLI 2), but WAL-G cannot rename backup during backup-push.
 

Naming considerations aside, I don't think that a timeline switch during
a standby backup is a good idea, mostly because it is (currently) not
tested.  We don't allow it in pgBackRest.

[1]
https://github.com/pgbackrest/pgbackrest/blob/release/2.15.1/lib/pgBackRest/Db.pm#L1008

-- 
-David
david@pgmasters.net



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Tue, 23 Jul 2019 16:00:29 -0400
David Steele <david@pgmasters.net> wrote:

> On 7/23/19 2:59 PM, Andrey Borodin wrote:
> >
> >> 23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> >> написал(а):
> >>
> >> Fetching the timeline from a standby could be useful in various situation.
> >> Either for backup tools [1] or failover tools during some kind of election
> >> process.
> > That backup tool is reading timeline from pg_control_checkpoint(). And
> > formats WAL file name itself when necessary.
>
> We do the same [1].

Thank you both for your comments.

OK, so backup tools are fine with reading slightly outdated data from
controldata file.

Anyway, my use case is mostly about auto failover. During election, I currently
have to force a checkpoint on standbys to get their real timeline from the
controldata.

However, the forced checkpoint could be very long[1] (considering auto
failover). I need to be able to compare TL without all the burden of a
CHECKPOINT just for this.

As I wrote, my favorite solution would be a function returning BOTH
current TL and LSN at the same time. I'll send a patch tomorrow to the list
and I'll bikeshedding later depending on the feedback.

In the meantime, previous patch might still be useful for some other purpose.
Comments are welcomed.

Thanks,

[1] this exact use case is actually hiding behind this thread:
https://www.postgresql.org/message-id/flat/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU%3Dq3sBiEZUm-D-sQ%40mail.gmail.com



Re: Fetching timeline during recovery

From
Michael Paquier
Date:
On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote:
> Please, find in attachment a first trivial patch to support pg_walfile_name()
> and pg_walfile_name_offset() on a standby.
> Previous restriction on this functions seems related to ThisTimeLineID not
> being safe on standby. This patch is fetching the timeline from
> WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
> this is updated each time some data are flushed to the WAL.

FWIW, I don't have any objections to lift a bit the restrictions on
those functions if we can make that reliable enough.  Now during
recovery you cannot rely on ThisTimeLineID as you say, per mostly the
following bit in xlog.c (the comment block a little bit up also has
explanations):
   /*
    * ThisTimeLineID is normally not set when we're still in recovery.
    * However, recycling/preallocating segments above needed ThisTimeLineID
    * to determine which timeline to install the segments on. Reset it now,
    * to restore the normal state of affairs for debugging purposes.
    */
    if (RecoveryInProgress())
        ThisTimeLineID = 0;

Your patch does not count for the case of archive recovery, where
there is no WAL receiver, and as the shared memory state of the WAL
receiver is not set 0 would be set.  The replay timeline is something
we could use here instead via GetXLogReplayRecPtr().
CreateRestartPoint actually takes the latest WAL receiver or replayed
point for its end LSN position, whichever is newer.

> Last, I plan to produce an extension to support this on older release. Is
> it something that could be integrated in official source tree during a minor
> release or should I publish it on eg. pgxn?

Unfortunately no.  This is a behavior change so it cannot find its way
into back branches.  The WAL receiver state is in shared memory and
published, so that's easy enough to get.  We don't do that for XLogCtl
unfortunately.  I think that there are arguments for being more
flexible with it, and perhaps have a system-level view to be able to
look at some of its fields.

There is also a downside with get_controlfile(), which is that it
fetches directly the data from the on-disk pg_control, and
post-recovery this only gets updated at the first checkpoint.
--
Michael

Attachment

Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
Hello Michael,

On Wed, 24 Jul 2019 09:49:05 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote:
> > Please, find in attachment a first trivial patch to support
> > pg_walfile_name() and pg_walfile_name_offset() on a standby.
> > Previous restriction on this functions seems related to ThisTimeLineID not
> > being safe on standby. This patch is fetching the timeline from
> > WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
> > this is updated each time some data are flushed to the WAL.  
[...]
> Your patch does not count for the case of archive recovery, where
> there is no WAL receiver, and as the shared memory state of the WAL
> receiver is not set 0 would be set.

Indeed. I tested this topic with the following query and was fine with the
NULL result:

  select pg_walfile_name(pg_last_wal_receive_lsn());

I was fine with this result because my use case requires replication anyway. A
NULL result would mean that the node never streamed from the old primary since
its last startup, so a failover should ignore it anyway.

However, NULL just comes from pg_last_wal_receive_lsn() here. The following
query result is wrong:

  > select pg_walfile_name('0/1')
  000000000000000000000000

I fixed that. See patch 0001-v2-* in attachement


> The replay timeline is something we could use here instead via
> GetXLogReplayRecPtr(). CreateRestartPoint actually takes the latest WAL
> receiver or replayed point for its end LSN position, whichever is newer.

I did consider GetXLogReplayRecPtr() or even XLogCtl->replayEndTLI (which is
updated right before the replay). However, both depend on read activity on the
standby. That's why I picked WalRcv->receivedTLI which is updated whatever the
reading activity on the standby.

> > Last, I plan to produce an extension to support this on older release. Is
> > it something that could be integrated in official source tree during a minor
> > release or should I publish it on eg. pgxn?  
> 
> Unfortunately no.  This is a behavior change so it cannot find its way
> into back branches.

Yes, my patch is a behavior change. But here, I was yalking about an
extension, not the core itself, to support this feature in older releases.

> The WAL receiver state is in shared memory and published, so that's easy
> enough to get.  We don't do that for XLogCtl unfortunately.

Both are in shared memory, but WalRcv have a public function to get its
receivedTLI member.

XLogCtl has nothing in public to expose its ThisTimeLineID member. However, from
a module, I'm able to fetch it using:

  XLogCtl = ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &found);
  SpinLockAcquire(&XLogCtl->info_lck);
  tl = XLogCtl->ThisTimeLineID;
  SpinLockRelease(&XLogCtl->info_lck);

As the "XLOG Ctl" index entry already exists in shmem, ShmemInitStruct returns
the correct structure from there. Not sure this was supposed to be used this
way though...Adding a public function might be cleaner, but it will not help
for older releases.

> I think that there are arguments for being more flexible with it, and perhaps
> have a system-level view to be able to look at some of its fields.

Great idea. I'll give it a try to keep the discussion on.

> There is also a downside with get_controlfile(), which is that it
> fetches directly the data from the on-disk pg_control, and
> post-recovery this only gets updated at the first checkpoint.

Indeed, that's why I started this patch and thread.

Thanks,

Attachment

Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
Hello,

On Wed, 24 Jul 2019 14:33:27 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> On Wed, 24 Jul 2019 09:49:05 +0900
> Michael Paquier <michael@paquier.xyz> wrote:
> 
> > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
> > wrote:  
[...]
> > I think that there are arguments for being more flexible with it, and
> > perhaps have a system-level view to be able to look at some of its fields.  
> 
> Great idea. I'll give it a try to keep the discussion on.

After some thinking, I did not find enough data to expose to justify the
creation a system-level view. As I just need the current timeline I
wrote "pg_current_timeline()". Please, find the patch in attachment.

The current behavior is quite simple: 
* if the cluster is in production, return ThisTimeLineID
* else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)

This is really naive implementation. We should probably add some code around
the startup process to gather and share general recovery stats. This would
allow to fetch eg. the current recovery method, latest xlog file name restored
from archives or streaming, its timeline, etc.

Any thoughts?

Regards,

Attachment

Re: Fetching timeline during recovery

From
Kyotaro Horiguchi
Date:
Hi.

At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in
<20190725193808.1648ddc8@firost>
> On Wed, 24 Jul 2019 14:33:27 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> 
> > On Wed, 24 Jul 2019 09:49:05 +0900
> > Michael Paquier <michael@paquier.xyz> wrote:
> > 
> > > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
> > > wrote:  
> [...]
> > > I think that there are arguments for being more flexible with it, and
> > > perhaps have a system-level view to be able to look at some of its fields.  
> > 
> > Great idea. I'll give it a try to keep the discussion on.
> 
> After some thinking, I did not find enough data to expose to justify the
> creation a system-level view. As I just need the current timeline I
> wrote "pg_current_timeline()". Please, find the patch in attachment.
> 
> The current behavior is quite simple: 
> * if the cluster is in production, return ThisTimeLineID
> * else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)
> 
> This is really naive implementation. We should probably add some code around
> the startup process to gather and share general recovery stats. This would
> allow to fetch eg. the current recovery method, latest xlog file name restored
> from archives or streaming, its timeline, etc.
> 
> Any thoughts?

If replay is delayed behind timeline switch point, replay-LSN and
receive/write/flush LSNs are on different timelines.  When
replica have not reached the new timeline to which alredy
received file belongs, the fucntion returns wrong file name,
specifically a name consisting of the latest segment number and
the older timeline where the segment doesn't belong to.

We have an LSN reporting function each for several objectives.

 pg_current_wal_lsn
 pg_current_wal_insert_lsn
 pg_current_wal_flush_lsn
 pg_last_wal_receive_lsn
 pg_last_wal_replay_lsn

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..


The function returns NULL for NULL input (STRICT behavior) but
returns (NULL, NULL) for undefined timeline. I don't think the
differene is meaningful.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Hi.
> 
> At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in <20190725193808.1648ddc8@firost>
> > On Wed, 24 Jul 2019 14:33:27 +0200
> > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> >   
> > > On Wed, 24 Jul 2019 09:49:05 +0900
> > > Michael Paquier <michael@paquier.xyz> wrote:
> > >   
> > > > On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
> > > > wrote:    
> > [...]  
> > > > I think that there are arguments for being more flexible with it, and
> > > > perhaps have a system-level view to be able to look at some of its
> > > > fields.    
> > > 
> > > Great idea. I'll give it a try to keep the discussion on.  
> > 
> > After some thinking, I did not find enough data to expose to justify the
> > creation a system-level view. As I just need the current timeline I
> > wrote "pg_current_timeline()". Please, find the patch in attachment.
> > 
> > The current behavior is quite simple: 
> > * if the cluster is in production, return ThisTimeLineID
> > * else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)
> > 
> > This is really naive implementation. We should probably add some code around
> > the startup process to gather and share general recovery stats. This would
> > allow to fetch eg. the current recovery method, latest xlog file name
> > restored from archives or streaming, its timeline, etc.
> > 
> > Any thoughts?  
> 
> If replay is delayed behind timeline switch point, replay-LSN and
> receive/write/flush LSNs are on different timelines.  When
> replica have not reached the new timeline to which alredy
> received file belongs, the fucntion returns wrong file name,
> specifically a name consisting of the latest segment number and
> the older timeline where the segment doesn't belong to.

Indeed.

> We have an LSN reporting function each for several objectives.
> 
>  pg_current_wal_lsn
>  pg_current_wal_insert_lsn
>  pg_current_wal_flush_lsn
>  pg_last_wal_receive_lsn
>  pg_last_wal_replay_lsn

Yes. In fact, my current implementation might be split as:

  pg_current_wal_tl: returns TL on a production cluster
  pg_last_wal_received_tl: returns last received TL on a standby

If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
*flush_tl would be useful as a cluster in production is not supposed to
change its timeline during its lifetime.

> But, I'm not sure just adding further pg_last_*_timeline() to
> this list is a good thing..

I think this is a much better idea than mixing different case (production and
standby) in the same function as I did. Moreover, it's much more coherent with
other existing functions.

> The function returns NULL for NULL input (STRICT behavior) but
> returns (NULL, NULL) for undefined timeline. I don't think the
> differene is meaningful.

Unless I'm missing something, nothing
returns "(NULL, NULL)" in 0001-v1-Add-function-pg_current_timeline.patch.

Thank you for your feedback!



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]
> > We have an LSN reporting function each for several objectives.
> > 
> >  pg_current_wal_lsn
> >  pg_current_wal_insert_lsn
> >  pg_current_wal_flush_lsn
> >  pg_last_wal_receive_lsn
> >  pg_last_wal_replay_lsn  
> 
> Yes. In fact, my current implementation might be split as:
> 
>   pg_current_wal_tl: returns TL on a production cluster
>   pg_last_wal_received_tl: returns last received TL on a standby
> 
> If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
> *flush_tl would be useful as a cluster in production is not supposed to
> change its timeline during its lifetime.
> 
> > But, I'm not sure just adding further pg_last_*_timeline() to
> > this list is a good thing..  
> 
> I think this is a much better idea than mixing different case (production and
> standby) in the same function as I did. Moreover, it's much more coherent with
> other existing functions.

Please, find in attachment a new version of the patch. It now creates two new
fonctions: 

  pg_current_wal_tl()
  pg_last_wal_received_tl()

Regards,

Attachment

Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Fri, 26 Jul 2019 18:22:25 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> On Fri, 26 Jul 2019 10:02:58 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> 
> > On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> [...]
> > > We have an LSN reporting function each for several objectives.
> > > 
> > >  pg_current_wal_lsn
> > >  pg_current_wal_insert_lsn
> > >  pg_current_wal_flush_lsn
> > >  pg_last_wal_receive_lsn
> > >  pg_last_wal_replay_lsn  
> > 
> > Yes. In fact, my current implementation might be split as:
> > 
> >   pg_current_wal_tl: returns TL on a production cluster
> >   pg_last_wal_received_tl: returns last received TL on a standby
> > 
> > If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
> > *flush_tl would be useful as a cluster in production is not supposed to
> > change its timeline during its lifetime.
> > 
> > > But, I'm not sure just adding further pg_last_*_timeline() to
> > > this list is a good thing..  
> > 
> > I think this is a much better idea than mixing different case (production
> > and standby) in the same function as I did. Moreover, it's much more
> > coherent with other existing functions.
> 
> Please, find in attachment a new version of the patch. It now creates two new
> fonctions: 
> 
>   pg_current_wal_tl()
>   pg_last_wal_received_tl()

I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
Please find the corrected patch in attachment:
0001-v3-Add-functions-to-get-timeline.patch

Also, TimeLineID is declared as a uint32. So why do we use
PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
See eg. in pg_stat_get_wal_receiver().

Regards,

Attachment

Re: Fetching timeline during recovery

From
Fujii Masao
Date:
On Mon, Jul 29, 2019 at 7:26 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> On Fri, 26 Jul 2019 18:22:25 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
>
> > On Fri, 26 Jul 2019 10:02:58 +0200
> > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> >
> > > On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
> > > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > [...]
> > > > We have an LSN reporting function each for several objectives.
> > > >
> > > >  pg_current_wal_lsn
> > > >  pg_current_wal_insert_lsn
> > > >  pg_current_wal_flush_lsn
> > > >  pg_last_wal_receive_lsn
> > > >  pg_last_wal_replay_lsn
> > >
> > > Yes. In fact, my current implementation might be split as:
> > >
> > >   pg_current_wal_tl: returns TL on a production cluster
> > >   pg_last_wal_received_tl: returns last received TL on a standby
> > >
> > > If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
> > > *flush_tl would be useful as a cluster in production is not supposed to
> > > change its timeline during its lifetime.
> > >
> > > > But, I'm not sure just adding further pg_last_*_timeline() to
> > > > this list is a good thing..
> > >
> > > I think this is a much better idea than mixing different case (production
> > > and standby) in the same function as I did. Moreover, it's much more
> > > coherent with other existing functions.
> >
> > Please, find in attachment a new version of the patch. It now creates two new
> > fonctions:
> >
> >   pg_current_wal_tl()
> >   pg_last_wal_received_tl()
>
> I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
> Please find the corrected patch in attachment:
> 0001-v3-Add-functions-to-get-timeline.patch

Thanks for the patch! Here are some comments from me.

You need to write the documentation explaining the functions
that you're thinking to add.

+/*
+ * Returns the current timeline on a production cluster
+ */
+Datum
+pg_current_wal_tl(PG_FUNCTION_ARGS)

The timeline ID that this function returns seems almost
the same as pg_control_checkpoint().timeline_id,
when the server is in production. So I'm not sure
if it's worth adding that new function.

+ currentTL = GetCurrentTimeLine();
+
+ PG_RETURN_INT32(currentTL);

Is GetCurrentTimeLine() really necessary? Seems ThisTimeLineID can be
returned directly since it indicates the current timeline ID in production.

+pg_last_wal_received_tl(PG_FUNCTION_ARGS)
+{
+ TimeLineID lastReceivedTL;
+ WalRcvData *walrcv = WalRcv;
+
+ SpinLockAcquire(&walrcv->mutex);
+ lastReceivedTL = walrcv->receivedTLI;
+ SpinLockRelease(&walrcv->mutex);

I think that it's smarter to use GetWalRcvWriteRecPtr() to
get the last received TLI, like pg_last_wal_receive_lsn() does.

The timeline ID that this function returns is the same as
pg_stat_wal_receiver.received_tli while walreceiver is running.
But when walreceiver is not running, pg_stat_wal_receiver returns
no record, and pg_last_wal_received_tl() would be useful to
get the timeline only in this case. Is this my understanding right?

> Also, TimeLineID is declared as a uint32. So why do we use
> PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
> See eg. in pg_stat_get_wal_receiver().

pg_stat_wal_receiver.received_tli is declared as integer.

Regards,

-- 
Fujii Masao



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Wed, 4 Sep 2019 00:32:03 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Mon, Jul 29, 2019 at 7:26 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> >
> > On Fri, 26 Jul 2019 18:22:25 +0200
> > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> >  
> > > On Fri, 26 Jul 2019 10:02:58 +0200
> > > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]
> > > Please, find in attachment a new version of the patch. It now creates two
> > > new fonctions:
> > >
> > >   pg_current_wal_tl()
> > >   pg_last_wal_received_tl()  
> >
> > I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
> > Please find the corrected patch in attachment:
> > 0001-v3-Add-functions-to-get-timeline.patch  
> 
> Thanks for the patch! Here are some comments from me.

Thank you for your review!

Please, find in attachment the v4 of the patch:
0001-v4-Add-functions-to-get-timeline.patch

Answers bellow.

> You need to write the documentation explaining the functions
> that you're thinking to add.

Done.

> +/*
> + * Returns the current timeline on a production cluster
> + */
> +Datum
> +pg_current_wal_tl(PG_FUNCTION_ARGS)
> 
> The timeline ID that this function returns seems almost
> the same as pg_control_checkpoint().timeline_id,
> when the server is in production. So I'm not sure
> if it's worth adding that new function.

pg_control_checkpoint().timeline_id is read from the controldata file on disk
which is asynchronously updated with the real status of the local cluster.
Right after a promotion, fetching the TL from pg_control_checkpoint() is wrong
and can cause race conditions on client side.

This is the main reason I am working on this patch.

> + currentTL = GetCurrentTimeLine();
> +
> + PG_RETURN_INT32(currentTL);
> 
> Is GetCurrentTimeLine() really necessary? Seems ThisTimeLineID can be
> returned directly since it indicates the current timeline ID in production.

Indeed. I might have over-focused on memory state. ThisTimeLineID seems to be
updated soon enough during the promotion, in fact, even before
XLogCtl->ThisTimeLineID:

        if (ArchiveRecoveryRequested)
        {
        [...]
                ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
        [...]
        }

        /* Save the selected TimeLineID in shared memory, too */
        XLogCtl->ThisTimeLineID = ThisTimeLineID;

> +pg_last_wal_received_tl(PG_FUNCTION_ARGS)
> +{
> + TimeLineID lastReceivedTL;
> + WalRcvData *walrcv = WalRcv;
> +
> + SpinLockAcquire(&walrcv->mutex);
> + lastReceivedTL = walrcv->receivedTLI;
> + SpinLockRelease(&walrcv->mutex);
> 
> I think that it's smarter to use GetWalRcvWriteRecPtr() to
> get the last received TLI, like pg_last_wal_receive_lsn() does.

I has been hesitant between the current implementation and using
GetWalRcvWriteRecPtr(). I choose the current implementation to avoid unnecessary
operations during the spinlock and make it as fast as possible.

However, maybe I'm scratching nothing or just dust here in comparison to
calling GetWalRcvWriteRecPtr() and avoiding minor code duplication.

Being hesitant, v4 of the patch use GetWalRcvWriteRecPtr() as suggested.

> The timeline ID that this function returns is the same as
> pg_stat_wal_receiver.received_tli while walreceiver is running.
> But when walreceiver is not running, pg_stat_wal_receiver returns
> no record, and pg_last_wal_received_tl() would be useful to
> get the timeline only in this case. Is this my understanding right?

Exactly.
 
> > Also, TimeLineID is declared as a uint32. So why do we use
> > PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
> > See eg. in pg_stat_get_wal_receiver().  
> 
> pg_stat_wal_receiver.received_tli is declared as integer.

Oh, right. Thank you.

Thanks,

Attachment

Re: Fetching timeline during recovery

From
Fujii Masao
Date:
On Sat, Sep 7, 2019 at 12:06 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> On Wed, 4 Sep 2019 00:32:03 +0900
> Fujii Masao <masao.fujii@gmail.com> wrote:
>
> > On Mon, Jul 29, 2019 at 7:26 PM Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote:
> > >
> > > On Fri, 26 Jul 2019 18:22:25 +0200
> > > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> > >
> > > > On Fri, 26 Jul 2019 10:02:58 +0200
> > > > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> [...]
> > > > Please, find in attachment a new version of the patch. It now creates two
> > > > new fonctions:
> > > >
> > > >   pg_current_wal_tl()
> > > >   pg_last_wal_received_tl()
> > >
> > > I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
> > > Please find the corrected patch in attachment:
> > > 0001-v3-Add-functions-to-get-timeline.patch
> >
> > Thanks for the patch! Here are some comments from me.
>
> Thank you for your review!
>
> Please, find in attachment the v4 of the patch:
> 0001-v4-Add-functions-to-get-timeline.patch

Thanks for updating the patch!

Should we add regression tests for these functions? For example,
what about using these functions to check the timeline switch case,
in src/test/recovery/t/004_timeline_switch.pl?

>
> Answers bellow.
>
> > You need to write the documentation explaining the functions
> > that you're thinking to add.
>
> Done.

Thanks!

+       <entry>Get current write-ahead log timeline</entry>

I'm not sure if "current write-ahead log timeline" is proper word.
"timeline ID of current write-ahead log" is more appropriate?

+       <entry><type>int</type></entry>
+       <entry>Get last write-ahead log timeline received and sync to disk by
+        streaming replication.

Same as above. I think that "timeline ID of last write-ahead log received
and sync to disk ..." is better here.

Like pg_last_wal_receive_lsn(), something like "If recovery has
completed this will remain static at the value of the last WAL
record received and synced to disk during recovery.
If streaming replication is disabled, or if it has not yet started,
the function returns NULL." should be in this description?

>
> > +/*
> > + * Returns the current timeline on a production cluster
> > + */
> > +Datum
> > +pg_current_wal_tl(PG_FUNCTION_ARGS)

I think that "tl" in the function name should be "tli". "tli" is used
used for other functions and views related to timeline, e.g.,
pg_stat_wal_receiver.received_tli. Thought?

> >
> > The timeline ID that this function returns seems almost
> > the same as pg_control_checkpoint().timeline_id,
> > when the server is in production. So I'm not sure
> > if it's worth adding that new function.
>
> pg_control_checkpoint().timeline_id is read from the controldata file on disk
> which is asynchronously updated with the real status of the local cluster.
> Right after a promotion, fetching the TL from pg_control_checkpoint() is wrong
> and can cause race conditions on client side.

Understood.

> > The timeline ID that this function returns is the same as
> > pg_stat_wal_receiver.received_tli while walreceiver is running.
> > But when walreceiver is not running, pg_stat_wal_receiver returns
> > no record, and pg_last_wal_received_tl() would be useful to
> > get the timeline only in this case. Is this my understanding right?
>
> Exactly.

I'm just imaging that some users want to use pg_last_wal_receive_lsn() and
pg_last_wal_receive_tli() together to, e.g., get the name of WAL file received
last. But there can be a corner case where the return values of
pg_last_wal_receive_lsn() and of pg_last_wal_receive_tli() are inconsistent.
This can happen because those values are NOT gotten within single lock.
That is, each function takes each lock to get each value.

So, to avoid that corner case and get consistent WAL file name,
we might want to have the function that gets both LSN and
timeline ID of the last received WAL record within single lock
(i.e., just uses GetWalRcvWriteRecPtr()) and returns them.
Thought?

Regards,

-- 
Fujii Masao



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Mon, 9 Sep 2019 19:44:10 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Sat, Sep 7, 2019 at 12:06 AM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> >
> > On Wed, 4 Sep 2019 00:32:03 +0900
> > Fujii Masao <masao.fujii@gmail.com> wrote:
> >  
[...]
> Thanks for updating the patch!

Thank you for your review!

Please find in attachment a new version of the patch.

  0001-v5-Add-facilities-to-fetch-real-timeline-from-SQL.patch

> Should we add regression tests for these functions? For example,
> what about using these functions to check the timeline switch case,
> in src/test/recovery/t/004_timeline_switch.pl?

Indeed, I added 6 tests to this file.

> [...] 

Thank you for all other suggestions. They all make sense for v4 of the patch.
However, I removed pg_current_wal_tl() and pg_last_wal_received_tl() to explore
a patch paying attention to your next comment.

> I'm just imaging that some users want to use pg_last_wal_receive_lsn() and
> pg_last_wal_receive_tli() together to, e.g., get the name of WAL file received
> last. But there can be a corner case where the return values of
> pg_last_wal_receive_lsn() and of pg_last_wal_receive_tli() are inconsistent.
> This can happen because those values are NOT gotten within single lock.
> That is, each function takes each lock to get each value.
> 
> So, to avoid that corner case and get consistent WAL file name,
> we might want to have the function that gets both LSN and
> timeline ID of the last received WAL record within single lock
> (i.e., just uses GetWalRcvWriteRecPtr()) and returns them.
> Thought?

You are right.

SO either I add some new functions or I overload the existing ones.

I was not convinced to add two new functions very close to pg_current_wal_lsn
and pg_last_wal_receive_lsn but with a slightly different name (eg. suffixed
with _tli?).

I choose to overload pg_current_wal_lsn and pg_last_wal_receive_lsn with
pg_current_wal_lsn(with_tli bool) and pg_last_wal_receive_lsn(with_tli bool).

Both function returns the record (lsn pg_lsn,timeline int4). If with_tli is
NULL or false, the timeline field is NULL.

Documentation is updated to reflect this.

Thoughts?

If this solution is accepted, some other function of the same family might be
good candidates as well, for the sake of homogeneity:

* pg_current_wal_insert_lsn
* pg_current_wal_flush_lsn
* pg_last_wal_replay_lsn

However, I'm not sure how useful this would be.

Thanks again for your time, suggestions and review!

Regards,

Attachment

Re: Fetching timeline during recovery

From
Michael Paquier
Date:
On Thu, Sep 26, 2019 at 07:20:46PM +0200, Jehan-Guillaume de Rorthais wrote:
> If this solution is accepted, some other function of the same family might be
> good candidates as well, for the sake of homogeneity:
>
> * pg_current_wal_insert_lsn
> * pg_current_wal_flush_lsn
> * pg_last_wal_replay_lsn
>
> However, I'm not sure how useful this would be.
>
> Thanks again for your time, suggestions and review!

+{ oid => '3435', descr => 'current wal flush location',
+  proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
proisstrict => 'f',
This description is incorrect.

And please use OIDs in the range of 8000~9999 for patches in
development.  You could just use src/include/catalog/unused_oids which
would point out a random range.

+   if (recptr == 0) {
+       nulls[0] = 1;
+       nulls[1] = 1;
+   }
The indendation of the code is incorrect, these should use actual
booleans and recptr should be InvalidXLogRecPtr (note also the
existence of the macro XLogRecPtrIsInvalid).  Just for the style.

As said in the last emails exchanged on this thread, I don't see how
you cannot use multiple functions which have different meaning
depending on if the cluster is a primary or a standby knowing that we
have two different concepts of WAL when at recovery: the received
LSN and the replayed LSN, and three concepts for primaries (insert,
current, flush).  I agree as well with the point of Fujii-san about
not returning the TLI and the LSN across different functions as this
opens the door for a risk of inconsistency for the data received by
the client.

+ * When the first parameter (variable 'with_tli') is true, returns the current
+ * timeline as second field. If false, second field is null.
I don't see much the point of having this input parameter which
determines the NULL-ness of one of the result columns, and I think
that you had better use a completely different function name for each
one of them instead of enforcing the functions.  Let's remember that a
lot of tools use the existing functions directly in the SELECT clause
for LSN calculations, which is just a 64-bit integer *without* a
timeline assigned to it.  However your patch mixes both concepts by
using pg_current_wal_lsn.

So we could do more with the introduction of five new functions which
allow to grab the LSN and the TLI in use for replay, received, insert,
write and flush positions:
- pg_current_wal_flush_info
- pg_current_wal_insert_info
- pg_current_wal_info
- pg_last_wal_receive_info
- pg_last_wal_replay_info

I would be actually tempted to do the following: one single SRF
function, say pg_wal_info which takes a text argument in input with
the following values: flush, write, insert, receive, replay.  Thinking
more about it that would be rather neat, and more extensible than the
rest discussed until now.  See for example PostgresNode::lsn.
--
Michael

Attachment

Re: Fetching timeline during recovery

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> I would be actually tempted to do the following: one single SRF
> function, say pg_wal_info which takes a text argument in input with
> the following values: flush, write, insert, receive, replay.  Thinking
> more about it that would be rather neat, and more extensible than the
> rest discussed until now.  See for example PostgresNode::lsn.

I've not followed this discussion very closely but I agree entirely that
it's really nice to have the timeline be able to be queried in a more
timely manner than asking through pg_control_checkpoint() gives you.

I'm not sure about adding a text argument to such a function though, I
would think you'd either have multiple rows if it's an SRF that gives
you the information on each row and allows a user to filter with a WHERE
clause, or do something like what pg_stat_replication has and just have
a bunch of columns.

Given that we've already gone with the "bunch of columns" approach
elsewhere, it seems like that approach would be more consistent.

Thanks,

Stephen

Attachment

Re: Fetching timeline during recovery

From
Michael Paquier
Date:
On Wed, Dec 11, 2019 at 10:16:29AM -0500, Stephen Frost wrote:
> I've not followed this discussion very closely but I agree entirely that
> it's really nice to have the timeline be able to be queried in a more
> timely manner than asking through pg_control_checkpoint() gives you.
>
> I'm not sure about adding a text argument to such a function though, I
> would think you'd either have multiple rows if it's an SRF that gives
> you the information on each row and allows a user to filter with a WHERE
> clause, or do something like what pg_stat_replication has and just have
> a bunch of columns.

With a NULL added for the values which cannot be defined then, like
trying to use the function on a primary for the fields which can only
show up at recovery?  That would be possible, still my heart tells me
that a function returning one row is a more natural approach for
this stuff.  I may be under too much used to what we have in the TAP
tests though.
--
Michael

Attachment

Re: Fetching timeline during recovery

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Wed, Dec 11, 2019 at 10:16:29AM -0500, Stephen Frost wrote:
> > I've not followed this discussion very closely but I agree entirely that
> > it's really nice to have the timeline be able to be queried in a more
> > timely manner than asking through pg_control_checkpoint() gives you.
> >
> > I'm not sure about adding a text argument to such a function though, I
> > would think you'd either have multiple rows if it's an SRF that gives
> > you the information on each row and allows a user to filter with a WHERE
> > clause, or do something like what pg_stat_replication has and just have
> > a bunch of columns.
>
> With a NULL added for the values which cannot be defined then, like
> trying to use the function on a primary for the fields which can only
> show up at recovery?

Sure, the function would only return those values that make sense for
the state that the system is in.

> That would be possible, still my heart tells me
> that a function returning one row is a more natural approach for
> this stuff.  I may be under too much used to what we have in the TAP
> tests though.

I'm confused- wouldn't the above approach be a function that's returning
only one row, if you had a bunch of columns and then had NULL values for
those cases that didn't apply..?  Or, if you were thinking about the SRF
approach that you suggested, you could use a WHERE clause to make it
only one row...  Though I can see how it's nicer to just have one row in
some cases which is why I was suggesting the "bunch of columns"
approach.

Thanks,

Stephen

Attachment

Re: Fetching timeline during recovery

From
Michael Paquier
Date:
On Wed, Dec 11, 2019 at 10:45:25AM -0500, Stephen Frost wrote:
> I'm confused- wouldn't the above approach be a function that's returning
> only one row, if you had a bunch of columns and then had NULL values for
> those cases that didn't apply..?  Or, if you were thinking about the SRF
> approach that you suggested, you could use a WHERE clause to make it
> only one row...  Though I can see how it's nicer to just have one row in
> some cases which is why I was suggesting the "bunch of columns"
> approach.

Oh, sorry.  I see the confusion now and that's my fault.  In
https://www.postgresql.org/message-id/20191211052002.GK72921@paquier.xyz
I mentioned a SRF function which takes an input argument, but that
makes no sense.  What I would prefer having is just having one
function, returning one row (LSN, TLI), using in input one argument to
extract the WAL information the caller wants with five possible cases
(write, insert, flush, receive, replay).

Then, what you are referring to is one function which returns all
(LSN,TLI) for the five cases (write, insert, etc.), so it would return
one row with 10 columns, with NULL mapping to the values which have no
meaning (like replay on a primary).

And on top of that we have a third possibility: one SRF function
returning 5 rows with three attributes (mode, LSN, TLI), where mode
corresponds to one value in the set {write, insert, etc.}.

I actually prefer the first one, and you mentioned the second.  But
there could be a point in doing the third one.  An advantage of the
second and third ones is that you may be able to get a consistent view
of all the data, but it means holding locks to look at the values a
bit longer.  Let's see what others think.
--
Michael

Attachment

Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Wed, 11 Dec 2019 14:20:02 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Sep 26, 2019 at 07:20:46PM +0200, Jehan-Guillaume de Rorthais wrote:
> > If this solution is accepted, some other function of the same family might
> > be good candidates as well, for the sake of homogeneity:
> > 
> > * pg_current_wal_insert_lsn
> > * pg_current_wal_flush_lsn
> > * pg_last_wal_replay_lsn
> > 
> > However, I'm not sure how useful this would be.
> > 
> > Thanks again for your time, suggestions and review!  
> 
> +{ oid => '3435', descr => 'current wal flush location',
> +  proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
> proisstrict => 'f',
> This description is incorrect.

Indeed. And the one for pg_current_wal_lsn(bool) as well.

> And please use OIDs in the range of 8000~9999 for patches in
> development.  You could just use src/include/catalog/unused_oids which
> would point out a random range.

Thank you for this information, I wasn't aware.

> +   if (recptr == 0) {
> +       nulls[0] = 1;
> +       nulls[1] = 1;
> +   }
> The indendation of the code is incorrect, these should use actual
> booleans and recptr should be InvalidXLogRecPtr (note also the
> existence of the macro XLogRecPtrIsInvalid).  Just for the style.

Fixed on my side. Thanks.

> As said in the last emails exchanged on this thread, I don't see how
> you cannot use multiple functions which have different meaning
> depending on if the cluster is a primary or a standby knowing that we
> have two different concepts of WAL when at recovery: the received
> LSN and the replayed LSN, and three concepts for primaries (insert,
> current, flush).  

As I wrote in my previous email, existing functions could be overloaded
as well for the sake of homogeneity. So the five of them would have similar
behavior/API.

> I agree as well with the point of Fujii-san about
> not returning the TLI and the LSN across different functions as this
> opens the door for a risk of inconsistency for the data received by
> the client.

My last patch fixed that, indeed.

> + * When the first parameter (variable 'with_tli') is true, returns the
> current
> + * timeline as second field. If false, second field is null.
> I don't see much the point of having this input parameter which
> determines the NULL-ness of one of the result columns, and I think
> that you had better use a completely different function name for each
> one of them instead of enforcing the functions.  Let's remember that a
> lot of tools use the existing functions directly in the SELECT clause
> for LSN calculations, which is just a 64-bit integer *without* a
> timeline assigned to it.  However your patch mixes both concepts by
> using pg_current_wal_lsn.

Sorry, I realize I was not clear enough about implementation details.
My latest patch does **not** introduce regression for existing tools. If you do
not pass any parameter, the behavior is the same, only one column:

  # primary
  $ cat <<EOQ|psql -XAtp 5432
  select * from pg_current_wal_lsn();
  select * from pg_current_wal_lsn(NULL);
  select * from pg_current_wal_lsn(true);
  EOQ
  0/15D5BA0
  0/15D5BA0|
  0/15D5BA0|1

  # secondary
  $ cat <<EOQ|psql -XAtp 5433
  select * from pg_last_wal_receive_lsn();
  select * from pg_last_wal_receive_lsn(NULL);
  select * from pg_last_wal_receive_lsn(true);
  EOQ
  0/15D5BA0
  0/15D5BA0|
  0/15D5BA0|1

It's kind of the same approach than when parameters has been added to
eg. pg_stat_backup() to change its behavior between exclusive and
non-exclusive backups. But I admit I know no function changing its return type
based on the given parameter. I understand your concern.

> So we could do more with the introduction of five new functions which 
> allow to grab the LSN and the TLI in use for replay, received, insert,
> write and flush positions:
> - pg_current_wal_flush_info
> - pg_current_wal_insert_info
> - pg_current_wal_info
> - pg_last_wal_receive_info
> - pg_last_wal_replay_info

I could go this way if you prefer, maybe using _tli as suffix instead of _info
as this is the only new info added. I think it feels redundant with original
funcs, but it might be the simplest solution.

> I would be actually tempted to do the following: one single SRF
> function, say pg_wal_info which takes a text argument in input with
> the following values: flush, write, insert, receive, replay.  Thinking
> more about it that would be rather neat, and more extensible than the
> rest discussed until now.  See for example PostgresNode::lsn.

I'll answer in your other mail that summary other possibilities.

Thanks!



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Fri, 13 Dec 2019 16:12:55 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Wed, Dec 11, 2019 at 10:45:25AM -0500, Stephen Frost wrote:
> > I'm confused- wouldn't the above approach be a function that's returning
> > only one row, if you had a bunch of columns and then had NULL values for
> > those cases that didn't apply..?  Or, if you were thinking about the SRF
> > approach that you suggested, you could use a WHERE clause to make it
> > only one row...  Though I can see how it's nicer to just have one row in
> > some cases which is why I was suggesting the "bunch of columns"
> > approach.  
> 
> Oh, sorry.  I see the confusion now and that's my fault.  In
> https://www.postgresql.org/message-id/20191211052002.GK72921@paquier.xyz
> I mentioned a SRF function which takes an input argument, but that
> makes no sense.  What I would prefer having is just having one
> function, returning one row (LSN, TLI), using in input one argument to
> extract the WAL information the caller wants with five possible cases
> (write, insert, flush, receive, replay).

It looks odd when we look at other five existing functions of the same family
but without the tli. And this user interaction with admin function is quite
different of what we are used to with other admin funcs. But mostly, when I
think of such function, I keep thinking this parameter should be a WHERE
clause after a SRF function.

-1

> Then, what you are referring to is one function which returns all
> (LSN,TLI) for the five cases (write, insert, etc.), so it would return
> one row with 10 columns, with NULL mapping to the values which have no
> meaning (like replay on a primary).

This would looks like some other pg_stat_* functions, eg. pg_stat_get_archiver.
I'm OK with this. This could even be turned as a catalog view.

However, what's the point of gathering all the values eg from a production
cluster? Is it really useful to compare current/insert/flush LSN from wal
writer?

It's easier to answer from a standby point of view as the lag between received
and replayed might be interesting to report in various situations.

> And on top of that we have a third possibility: one SRF function
> returning 5 rows with three attributes (mode, LSN, TLI), where mode
> corresponds to one value in the set {write, insert, etc.}.

I prefer the second one. Just select the field(s) you need, no need WHERE
clause, similar to some other stats function.

-1


As a fourth possibility, as I badly explained my last implementation details, I
still hope we can keep it in the loop here. Just overload existing functions
with ones that takes a boolean as parameter and add the TLI as a second field,
eg.:

        Name       | Result type  | Argument data types
-------------------+--------------+-------------------------------------------
pg_current_wal_lsn | pg_lsn       |
pg_current_wal_lsn | SETOF record | with_tli bool, OUT lsn pg_lsn, OUT tli int


And the fifth one, implementing brand new functions:

 pg_current_wal_lsn_tli
 pg_current_wal_insert_lsn_tli
 pg_current_wal_flush_lsn_tli
 pg_last_wal_receive_lsn_tli
 pg_last_wal_replay_lsn_tli

> I actually prefer the first one, and you mentioned the second.  But
> there could be a point in doing the third one.  An advantage of the
> second and third ones is that you may be able to get a consistent view
> of all the data, but it means holding locks to look at the values a
> bit longer.  Let's see what others think.

I like the fourth one, but I was not able to return only one field if given
parameter is false or NULL. Giving false as argument to these funcs has no
meaning compared to the original one without arg. I end up with this solution
because I was worried about adding five more funcs really close to some
existing one.

Fifth one is more consistent with what we already have.

Thanks again.

Regards,



Re: Fetching timeline during recovery

From
Kyotaro Horiguchi
Date:
At Fri, 20 Dec 2019 00:35:19 +0100, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> On Fri, 13 Dec 2019 16:12:55 +0900
> Michael Paquier <michael@paquier.xyz> wrote:

The first one;

> > I mentioned a SRF function which takes an input argument, but that
> > makes no sense.  What I would prefer having is just having one
> > function, returning one row (LSN, TLI), using in input one argument to
> > extract the WAL information the caller wants with five possible cases
> > (write, insert, flush, receive, replay).
> 
> It looks odd when we look at other five existing functions of the same family
> but without the tli. And this user interaction with admin function is quite
> different of what we are used to with other admin funcs. But mostly, when I
> think of such function, I keep thinking this parameter should be a WHERE
> clause after a SRF function.
> 
> -1

It is realted to the third one, it may be annoying that the case names
cannot have an aid of psql-completion..


The second one;

> > Then, what you are referring to is one function which returns all
> > (LSN,TLI) for the five cases (write, insert, etc.), so it would return
> > one row with 10 columns, with NULL mapping to the values which have no
> > meaning (like replay on a primary).
> 
> This would looks like some other pg_stat_* functions, eg. pg_stat_get_archiver.
> I'm OK with this. This could even be turned as a catalog view.
> 
> However, what's the point of gathering all the values eg from a production
> cluster? Is it really useful to compare current/insert/flush LSN from wal
> writer?

There is a period where pg_controldata shows the previous TLI after
promotion. It's useful if we can read the up-to-date TLI from live
standby. I thought that this project is for that case..

> It's easier to answer from a standby point of view as the lag between received
> and replayed might be interesting to report in various situations.


The third one;

> > And on top of that we have a third possibility: one SRF function
> > returning 5 rows with three attributes (mode, LSN, TLI), where mode
> > corresponds to one value in the set {write, insert, etc.}.
> 
> I prefer the second one. Just select the field(s) you need, no need WHERE
> clause, similar to some other stats function.
> 
> -1

It might be clean in a sense, but I don't come up with the case where
the format is useful..

Anyway as the same with the first one, the case names (write, insert,
flush, receive, replay) comes from two different machineries and
showing them in a row could be confusing.


> As a fourth possibility, as I badly explained my last implementation details, I
> still hope we can keep it in the loop here. Just overload existing functions
> with ones that takes a boolean as parameter and add the TLI as a second field,
> eg.:
> 
>         Name       | Result type  | Argument data types
> -------------------+--------------+-------------------------------------------
> pg_current_wal_lsn | pg_lsn       |
> pg_current_wal_lsn | SETOF record | with_tli bool, OUT lsn pg_lsn, OUT tli int

I prefer this one, in the sense of similarity with existing functions.

> And the fifth one, implementing brand new functions:
> 
>  pg_current_wal_lsn_tli
>  pg_current_wal_insert_lsn_tli
>  pg_current_wal_flush_lsn_tli
>  pg_last_wal_receive_lsn_tli
>  pg_last_wal_replay_lsn_tli

Mmmmm.... We should remove exiting ones instead? (Of couse we don't,
though.)

> > I actually prefer the first one, and you mentioned the second.  But
> > there could be a point in doing the third one.  An advantage of the
> > second and third ones is that you may be able to get a consistent view
> > of all the data, but it means holding locks to look at the values a
> > bit longer.  Let's see what others think.
> 
> I like the fourth one, but I was not able to return only one field if given
> parameter is false or NULL. Giving false as argument to these funcs has no
> meaning compared to the original one without arg. I end up with this solution
> because I was worried about adding five more funcs really close to some
> existing one.

Right. It is a restriction of polymorphic functions. It is in the same
relation with pg_stop_backup() and pg_stop_backup(true).

> Fifth one is more consistent with what we already have.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Fri, 20 Dec 2019 13:41:25 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Fri, 20 Dec 2019 00:35:19 +0100, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in 
> > On Fri, 13 Dec 2019 16:12:55 +0900
> > Michael Paquier <michael@paquier.xyz> wrote:  
> 
> The first one;
> 
> > > I mentioned a SRF function which takes an input argument, but that
> > > makes no sense.  What I would prefer having is just having one
> > > function, returning one row (LSN, TLI), using in input one argument to
> > > extract the WAL information the caller wants with five possible cases
> > > (write, insert, flush, receive, replay).  
> > 
> > It looks odd when we look at other five existing functions of the same
> > family but without the tli. And this user interaction with admin function
> > is quite different of what we are used to with other admin funcs. But
> > mostly, when I think of such function, I keep thinking this parameter
> > should be a WHERE clause after a SRF function.
> > 
> > -1  
> 
> It is realted to the third one, it may be annoying that the case names
> cannot have an aid of psql-completion..

indeed.

> The second one;
> 
> > > Then, what you are referring to is one function which returns all
> > > (LSN,TLI) for the five cases (write, insert, etc.), so it would return
> > > one row with 10 columns, with NULL mapping to the values which have no
> > > meaning (like replay on a primary).  
> > 
> > This would looks like some other pg_stat_* functions, eg.
> > pg_stat_get_archiver. I'm OK with this. This could even be turned as a
> > catalog view.
> > 
> > However, what's the point of gathering all the values eg from a production
> > cluster? Is it really useful to compare current/insert/flush LSN from wal
> > writer?  
> 
> There is a period where pg_controldata shows the previous TLI after
> promotion. It's useful if we can read the up-to-date TLI from live
> standby. I thought that this project is for that case..

I was not asking about the usefulness of LSN+TLI itself. 
I was wondering about the usecase of gathering all 6 cols current+tli,
insert+tli and flush+tli from a production/primary cluster.

[...]
> > As a fourth possibility, as I badly explained my last implementation
> > details, I still hope we can keep it in the loop here. Just overload
> > existing functions with ones that takes a boolean as parameter and add the
> > TLI as a second field, eg.:
> > 
> >         Name       | Result type  | Argument data types
> > -------------------+--------------+-------------------------------------------
> > pg_current_wal_lsn | pg_lsn       |
> > pg_current_wal_lsn | SETOF record | with_tli bool, OUT lsn pg_lsn, OUT tli
> > int  
> 
> I prefer this one, in the sense of similarity with existing functions.

thanks

> > And the fifth one, implementing brand new functions:
> > 
> >  pg_current_wal_lsn_tli
> >  pg_current_wal_insert_lsn_tli
> >  pg_current_wal_flush_lsn_tli
> >  pg_last_wal_receive_lsn_tli
> >  pg_last_wal_replay_lsn_tli  
> 
> Mmmmm.... We should remove exiting ones instead? (Of couse we don't,
> though.)

Yes, that would be great but sadly, it would introduce a regression on various
tools relying on them. At least, the one doing "select *" or most
probably "select func()".

But anyway, adding 5 funcs is not a big deal neither. Too bad they are so close
to existing ones though.

> > > I actually prefer the first one, and you mentioned the second.  But
> > > there could be a point in doing the third one.  An advantage of the
> > > second and third ones is that you may be able to get a consistent view
> > > of all the data, but it means holding locks to look at the values a
> > > bit longer.  Let's see what others think.  
> > 
> > I like the fourth one, but I was not able to return only one field if given
> > parameter is false or NULL. Giving false as argument to these funcs has no
> > meaning compared to the original one without arg. I end up with this
> > solution because I was worried about adding five more funcs really close to
> > some existing one.  
> 
> Right. It is a restriction of polymorphic functions. It is in the same
> relation with pg_stop_backup() and pg_stop_backup(true).

indeed.




Re: Fetching timeline during recovery

From
Michael Paquier
Date:
On Fri, Dec 20, 2019 at 11:14:28AM +0100, Jehan-Guillaume de Rorthais wrote:
> Yes, that would be great but sadly, it would introduce a regression on various
> tools relying on them. At least, the one doing "select *" or most
> probably "select func()".
>
> But anyway, adding 5 funcs is not a big deal neither. Too bad they are so close
> to existing ones though.

Consistency of the data matters a lot if we want to build reliable
tools on top of them in case someone would like to compare the various
modes, and using different functions for those fields creates locking
issues (somewhat the point of Fujii-san upthread?).  If nobody likes
the approach of one function, returning one row, taking in input the
mode wanted, then I would not really object Stephen's idea on the
matter about having a multi-column function returning one row.
issues

>> Right. It is a restriction of polymorphic functions. It is in the same
>> relation with pg_stop_backup() and pg_stop_backup(true).

(pg_current_wal_lsn & co talk about LSNs, not TLIs).
--
Michael

Attachment

Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Mon, 23 Dec 2019 12:36:56 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Dec 20, 2019 at 11:14:28AM +0100, Jehan-Guillaume de Rorthais wrote:
> > Yes, that would be great but sadly, it would introduce a regression on
> > various tools relying on them. At least, the one doing "select *" or most
> > probably "select func()".
> >
> > But anyway, adding 5 funcs is not a big deal neither. Too bad they are so
> > close to existing ones though.
>
> Consistency of the data matters a lot if we want to build reliable
> tools on top of them in case someone would like to compare the various
> modes, and using different functions for those fields creates locking
> issues (somewhat the point of Fujii-san upthread?).

To sum up: the original patch was about fetching the current timeline of a
standby from memory without relying on the asynchronous controlfile or
pg_stat_get_wal_receiver() which only shows data when the wal_receiver is
running.

Fujii-san was pointing out we must fetch both the received LSN and its timeline
with the same lock so they are consistent.

Michael is now discussing about fetching multiple LSN and their timeline,
while keeping them consistent, eg. received+tli and applied+tli. Thank you for
pointing this out.

I thought about various way to deal with this concern and would like to
discuss/defend a new option based on existing pg_stat_get_wal_receiver()
function. The only problem I'm facing with this function is that it returns
a full NULL record if no wal receiver is active.

My idea would be to return a row from pg_stat_get_wal_receiver() as soon as
a wal receiver has been replicating during the uptime of the standby, no
matter if there's one currently working or not. If no wal receiver is active,
the "pid" field would be NULL and the "status" would reports eg. "inactive".
All other fields would report their last known value as they are kept in
shared memory WalRcv struct.

From the monitoring and HA point of view, we are now able to know that a wal
receiver existed, the lsn it has stopped, on what timeline, all consistent
with the same lock. That answer my original goal. We could extend this with two
more fields about replayed lsn and timeline to address last Michael's concern
if we decide it's really needed (and I think it's a valid concern for eg.
monitoring tools).

There's some more potential discussion about the pg_stat_wal_receiver view
which relies on pg_stat_get_wal_receiver(). My proposal do not introduce
regression with it as the view already filter out NULL data using "WHERE s.pid
IS NOT NULL". But:

 1. we could decide to remove this filter to expose the data even when no wal
    receiver is active. It's the same behavior than pg_stat_subscription view.
    It could introduce regression from tools point of view, but adds some
    useful information. I would vote 0 for it.
 2. we could extend it with new replayed lsn/tli fields. I would vote +1 for
    it.

On the "dark" side of this proposal, we do not deal with the primary side. We
still have no way to fetch various lsn+tli from the WAL Writer. However, I
included pg_current_wal_lsn_tl() in my original patch only for homogeneity
reason and the discussion slipped on this side while paying attention to the
user facing function logic and homogeneity. If this discussion decide this is a
useful feature, I think it could be addressed in another patch (and I volunteer
to deal with it).

Bellow the sum up this 6th proposition with examples. When wal receiver never
started (same as today):

  -[ RECORD 1 ]---------+--
  pid                   | Ø
  status                | Ø
  receive_start_lsn     | Ø
  receive_start_tli     | Ø
  received_lsn          | Ø
  received_tli          | Ø
  last_msg_send_time    | Ø
  last_msg_receipt_time | Ø
  latest_end_lsn        | Ø
  latest_end_time       | Ø
  slot_name             | Ø
  sender_host           | Ø
  sender_port           | Ø
  conninfo              | Ø

When wal receiver is active:

  $ select * from  pg_stat_get_wal_receiver();
  -[ RECORD 1 ]---------+-----------------------------
  pid                   | 8576
  status                | streaming
  receive_start_lsn     | 0/4000000
  receive_start_tli     | 1
  received_lsn          | 0/4000148
  received_tli          | 1
  last_msg_send_time    | 2019-12-23 12:28:52.588738+01
  last_msg_receipt_time | 2019-12-23 12:28:52.588839+01
  latest_end_lsn        | 0/4000148
  latest_end_time       | 2019-12-23 11:15:43.431657+01
  slot_name             | Ø
  sender_host           | /tmp
  sender_port           | 15441
  conninfo              | port=15441 application_name=s

When wal receiver is not running and shared memory WalRcv is reporting past
activity:

  $ select * from  pg_stat_get_wal_receiver();
  -[ RECORD 1 ]---------+-----------------------------
  pid                   | Ø
  status                | inactive
  receive_start_lsn     | 0/4000000
  receive_start_tli     | 1
  received_lsn          | 0/4000148
  received_tli          | 1
  last_msg_send_time    | 2019-12-23 12:28:52.588738+01
  last_msg_receipt_time | 2019-12-23 12:28:52.588839+01
  latest_end_lsn        | 0/4000148
  latest_end_time       | 2019-12-23 11:15:43.431657+01
  slot_name             | Ø
  sender_host           | /tmp
  sender_port           | 15441
  conninfo              | port=15441 application_name=s

I just have a doubt about including the last three fields or setting them to
NULL. Note that the information is present and might still be useful to
understand what was the original source of a standby before disconnection.

Regards,



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
Hi,

On Mon, 23 Dec 2019 15:38:16 +0100
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]
> My idea would be to return a row from pg_stat_get_wal_receiver() as soon as
> a wal receiver has been replicating during the uptime of the standby, no
> matter if there's one currently working or not. If no wal receiver is active,
> the "pid" field would be NULL and the "status" would reports eg. "inactive".
> All other fields would report their last known value as they are kept in
> shared memory WalRcv struct.

Please, find in attachment a patch implementing the above proposal.

Regards,

Attachment

Re: Fetching timeline during recovery

From
Kyotaro Horiguchi
Date:
At Fri, 3 Jan 2020 16:11:38 +0100, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> Hi,
> 
> On Mon, 23 Dec 2019 15:38:16 +0100
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> [...]
> > My idea would be to return a row from pg_stat_get_wal_receiver() as soon as
> > a wal receiver has been replicating during the uptime of the standby, no
> > matter if there's one currently working or not. If no wal receiver is active,
> > the "pid" field would be NULL and the "status" would reports eg. "inactive".
> > All other fields would report their last known value as they are kept in
> > shared memory WalRcv struct.
> 
> Please, find in attachment a patch implementing the above proposal.

At Mon, 23 Dec 2019 15:38:16 +0100, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
>  1. we could decide to remove this filter to expose the data even when no wal
>     receiver is active. It's the same behavior than pg_stat_subscription view.
>     It could introduce regression from tools point of view, but adds some
>     useful information. I would vote 0 for it.

A subscription exists since it is defined and regardless whether it is
active or not. It is strange that we see a line in the view if
replication is not configured. But it is reasonable to show if it is
configured.  We could do that by checking PrimaryConnInfo. (I would
vote +0.5 for it).

>  2. we could extend it with new replayed lsn/tli fields. I would vote +1 for
>     it.

+1. As of now a walsender lives for just one timeline, because it ends
for disconnection from walsender when the master moves to a new
timeline.  That being said, we already have the columns for TLI for
both starting and received-up-to LSN so we would need it also for
replayed LSN for a consistent looking.

The function is going to show "streaming" but conninfo is not shown
until connection establishes. That state is currently hidden by the
PID filtering of the view. We might need to keep the WALRCV_STARTING
state until connection establishes.

sender_host and sender_port have bogus values until connection is
actually established when conninfo is changed. They as well as
conninfo should be hidden until connection is established, too, I
think.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Tue, 07 Jan 2020 15:57:29 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 23 Dec 2019 15:38:16 +0100, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in 
> >  1. we could decide to remove this filter to expose the data even when no
> > wal receiver is active. It's the same behavior than pg_stat_subscription
> > view. It could introduce regression from tools point of view, but adds some
> >     useful information. I would vote 0 for it.  
> 
> A subscription exists since it is defined and regardless whether it is
> active or not. It is strange that we see a line in the view if
> replication is not configured. But it is reasonable to show if it is
> configured.  We could do that by checking PrimaryConnInfo. (I would
> vote +0.5 for it).

Thanks. I put this on hold for now, I'm waiting for some more opinons as
there's no strong position yet.

> >  2. we could extend it with new replayed lsn/tli fields. I would vote +1 for
> >     it.  
> 
> +1. As of now a walsender lives for just one timeline, because it ends
> for disconnection from walsender when the master moves to a new
> timeline.  That being said, we already have the columns for TLI for
> both starting and received-up-to LSN so we would need it also for
> replayed LSN for a consistent looking.

I added applied_lsn and applied_tli to the pg_stat_get_wal_receiver function
output columns.

However, note that applying xlog is the responsibility of the startup process,
not the wal receiver one. Is it OK that pg_stat_get_wal_receiver
returns stats not directly related to the wal receiver?

> The function is going to show "streaming" but conninfo is not shown
> until connection establishes. That state is currently hidden by the
> PID filtering of the view. We might need to keep the WALRCV_STARTING
> state until connection establishes.

Indeed, fixed.

> sender_host and sender_port have bogus values until connection is
> actually established when conninfo is changed. They as well as
> conninfo should be hidden until connection is established, too, I
> think.

Fixed as well.

Please find the new version of the patch in attachment.

Thank you for your review!

Attachment

Re: Fetching timeline during recovery

From
Michael Paquier
Date:
On Thu, Jan 23, 2020 at 05:54:08PM +0100, Jehan-Guillaume de Rorthais wrote:
> Please find the new version of the patch in attachment.

To be honest, I find the concept of this patch confusing.
pg_stat_wal_receiver is just a one-one mapping with the shared memory
state of the WAL receiver itself and show data *if and only if* a WAL
receiver is running and iff it is ready to display any data, so I'd
rather not change its nature and it has nothing to do with the state
of WAL being applied by the startup process.  So this gets a -1 from
me.

-   /*
-    * No WAL receiver (or not ready yet), just return a tuple with NULL
-    * values
-    */
-   if (pid == 0 || !ready_to_display)
-       PG_RETURN_NULL();
Note that this took a couple of attempts to get right, so I'd rather
not change this part of the logic on security grounds.

Isn't what you are looking for here a different system view which maps
directly to XLogCtl so as you can retrieve the status of the applied
WAL at recovery anytime, say pg_stat_recovery?

It is the end of the CF, I am marking this patch as returned with
feedback for now.
--
Michael

Attachment

Re: Fetching timeline during recovery

From
Jehan-Guillaume de Rorthais
Date:
On Fri, 31 Jan 2020 15:12:30 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Jan 23, 2020 at 05:54:08PM +0100, Jehan-Guillaume de Rorthais wrote:
> > Please find the new version of the patch in attachment.
> 
> To be honest, I find the concept of this patch confusing.
> pg_stat_wal_receiver is just a one-one mapping with the shared memory
> state of the WAL receiver itself and show data *if and only if* a WAL
> receiver is running and iff it is ready to display any data, so I'd
> rather not change its nature

If you are talking about the pg_stat_wal_receiver view, I don't have a strong
opinion on this anyway as I vote 0 when discussing it. My current patch
doesn't alter its nature.

> and it has nothing to do with the state of WAL being applied by the startup
> process.

Indeed, I was feeling this was a bad design to add these columns, as stated in
my last mail. So I withdraw this.

> So this gets a -1 from me.

OK.

[...]
> Isn't what you are looking for here a different system view which maps
> directly to XLogCtl so as you can retrieve the status of the applied
> WAL at recovery anytime

My main objective is received LSN/TLI. This is kept by WalRcv for streaming.
That's why pg_stat_wal_receiver was the very good place for my need. But again,
you are right, I shouldn't have add the replied bits to it.

> say pg_stat_recovery?

I finally dig this path. I was in the hope we could find something
simpler and lighter, but other solutions we studied so far (thanks all for your
time) were all discarded [1].

A new pg_stat_get_recovery() view might be useful for various monitoring
purpose. After poking around in the code, it seems the patch would be bigger
than previous solutions, so I prefer discussing the specs first. 

At a first glance, I would imagine the following columns as a minimal patch:

* source: stream, archive or pg_wal
* write/flush/replayed LSN
* write/flush/replayed TLI

This has already some heavy impact in the code. Source might be taken from
xlog.c:currentSource, so it should probably be included in XLogCtl to be
accessible from any backend.

As replayed LSN/TLI comes from XLogCtl too, we might probably need a new
dedicated function to gather these fields plus currentSource under the same
info_lck.

Next, write lsn/tli is not accessible from WalRcv, only flush. So either we do
not include it, or we would probably need to replace WalRcv->receivedUpto with
existing LogstreamResult.

Next, there's no stats about wal shipping recovery. Restoring a WAL from
archive do not increment anything about write/flush LSN/TLI. I wonder if both
wal_receiver stats and WAL shipping stats might be merged together in the same
refactored structure in shmem, as they might share a fair number of field
together? This would be pretty invasive in the code, but I feel it's heavier to
add another new struct in shmem just to track WAL shipping stats whereas WalRcv
already exists there.

Now, I think the following additional field might be useful for monitoring. But
as this is out my my original scope, I prefer discussing how useful this might
be:

* start_time: start time of the current source
* restored_count: total number of WAL restored. We might want to split this
  counter to track each method individually.
* last_received_time: last time we received something from the current source
* last_fail_time: last failure time, whatever the source

Thanks for reading up to here!

Regards,


[1] even if I still hope the pg_stat_get_wal_receiver might still gather some
more positive vote :)