Re: Fetching timeline during recovery - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Re: Fetching timeline during recovery
Date
Msg-id 20200211195110.0751aa6b@firost
Whole thread Raw
In response to Re: Fetching timeline during recovery  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Fri, 31 Jan 2020 15:12:30 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Jan 23, 2020 at 05:54:08PM +0100, Jehan-Guillaume de Rorthais wrote:
> > Please find the new version of the patch in attachment.
> 
> To be honest, I find the concept of this patch confusing.
> pg_stat_wal_receiver is just a one-one mapping with the shared memory
> state of the WAL receiver itself and show data *if and only if* a WAL
> receiver is running and iff it is ready to display any data, so I'd
> rather not change its nature

If you are talking about the pg_stat_wal_receiver view, I don't have a strong
opinion on this anyway as I vote 0 when discussing it. My current patch
doesn't alter its nature.

> and it has nothing to do with the state of WAL being applied by the startup
> process.

Indeed, I was feeling this was a bad design to add these columns, as stated in
my last mail. So I withdraw this.

> So this gets a -1 from me.

OK.

[...]
> Isn't what you are looking for here a different system view which maps
> directly to XLogCtl so as you can retrieve the status of the applied
> WAL at recovery anytime

My main objective is received LSN/TLI. This is kept by WalRcv for streaming.
That's why pg_stat_wal_receiver was the very good place for my need. But again,
you are right, I shouldn't have add the replied bits to it.

> say pg_stat_recovery?

I finally dig this path. I was in the hope we could find something
simpler and lighter, but other solutions we studied so far (thanks all for your
time) were all discarded [1].

A new pg_stat_get_recovery() view might be useful for various monitoring
purpose. After poking around in the code, it seems the patch would be bigger
than previous solutions, so I prefer discussing the specs first. 

At a first glance, I would imagine the following columns as a minimal patch:

* source: stream, archive or pg_wal
* write/flush/replayed LSN
* write/flush/replayed TLI

This has already some heavy impact in the code. Source might be taken from
xlog.c:currentSource, so it should probably be included in XLogCtl to be
accessible from any backend.

As replayed LSN/TLI comes from XLogCtl too, we might probably need a new
dedicated function to gather these fields plus currentSource under the same
info_lck.

Next, write lsn/tli is not accessible from WalRcv, only flush. So either we do
not include it, or we would probably need to replace WalRcv->receivedUpto with
existing LogstreamResult.

Next, there's no stats about wal shipping recovery. Restoring a WAL from
archive do not increment anything about write/flush LSN/TLI. I wonder if both
wal_receiver stats and WAL shipping stats might be merged together in the same
refactored structure in shmem, as they might share a fair number of field
together? This would be pretty invasive in the code, but I feel it's heavier to
add another new struct in shmem just to track WAL shipping stats whereas WalRcv
already exists there.

Now, I think the following additional field might be useful for monitoring. But
as this is out my my original scope, I prefer discussing how useful this might
be:

* start_time: start time of the current source
* restored_count: total number of WAL restored. We might want to split this
  counter to track each method individually.
* last_received_time: last time we received something from the current source
* last_fail_time: last failure time, whatever the source

Thanks for reading up to here!

Regards,


[1] even if I still hope the pg_stat_get_wal_receiver might still gather some
more positive vote :)



pgsql-hackers by date:

Previous
From: marcelo zen
Date:
Subject: Re: Just for fun: Postgres 20?
Next
From: Alvaro Herrera
Date:
Subject: Re: Portal->commandTag as an enum