Re: Fetching timeline during recovery - Mailing list pgsql-hackers
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: Fetching timeline during recovery |
Date | |
Msg-id | 20191223153816.78190461@firost Whole thread Raw |
In response to | Re: Fetching timeline during recovery (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Fetching timeline during recovery
Re: Fetching timeline during recovery |
List | pgsql-hackers |
On Mon, 23 Dec 2019 12:36:56 +0900 Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Dec 20, 2019 at 11:14:28AM +0100, Jehan-Guillaume de Rorthais wrote: > > Yes, that would be great but sadly, it would introduce a regression on > > various tools relying on them. At least, the one doing "select *" or most > > probably "select func()". > > > > But anyway, adding 5 funcs is not a big deal neither. Too bad they are so > > close to existing ones though. > > Consistency of the data matters a lot if we want to build reliable > tools on top of them in case someone would like to compare the various > modes, and using different functions for those fields creates locking > issues (somewhat the point of Fujii-san upthread?). To sum up: the original patch was about fetching the current timeline of a standby from memory without relying on the asynchronous controlfile or pg_stat_get_wal_receiver() which only shows data when the wal_receiver is running. Fujii-san was pointing out we must fetch both the received LSN and its timeline with the same lock so they are consistent. Michael is now discussing about fetching multiple LSN and their timeline, while keeping them consistent, eg. received+tli and applied+tli. Thank you for pointing this out. I thought about various way to deal with this concern and would like to discuss/defend a new option based on existing pg_stat_get_wal_receiver() function. The only problem I'm facing with this function is that it returns a full NULL record if no wal receiver is active. My idea would be to return a row from pg_stat_get_wal_receiver() as soon as a wal receiver has been replicating during the uptime of the standby, no matter if there's one currently working or not. If no wal receiver is active, the "pid" field would be NULL and the "status" would reports eg. "inactive". All other fields would report their last known value as they are kept in shared memory WalRcv struct. From the monitoring and HA point of view, we are now able to know that a wal receiver existed, the lsn it has stopped, on what timeline, all consistent with the same lock. That answer my original goal. We could extend this with two more fields about replayed lsn and timeline to address last Michael's concern if we decide it's really needed (and I think it's a valid concern for eg. monitoring tools). There's some more potential discussion about the pg_stat_wal_receiver view which relies on pg_stat_get_wal_receiver(). My proposal do not introduce regression with it as the view already filter out NULL data using "WHERE s.pid IS NOT NULL". But: 1. we could decide to remove this filter to expose the data even when no wal receiver is active. It's the same behavior than pg_stat_subscription view. It could introduce regression from tools point of view, but adds some useful information. I would vote 0 for it. 2. we could extend it with new replayed lsn/tli fields. I would vote +1 for it. On the "dark" side of this proposal, we do not deal with the primary side. We still have no way to fetch various lsn+tli from the WAL Writer. However, I included pg_current_wal_lsn_tl() in my original patch only for homogeneity reason and the discussion slipped on this side while paying attention to the user facing function logic and homogeneity. If this discussion decide this is a useful feature, I think it could be addressed in another patch (and I volunteer to deal with it). Bellow the sum up this 6th proposition with examples. When wal receiver never started (same as today): -[ RECORD 1 ]---------+-- pid | Ø status | Ø receive_start_lsn | Ø receive_start_tli | Ø received_lsn | Ø received_tli | Ø last_msg_send_time | Ø last_msg_receipt_time | Ø latest_end_lsn | Ø latest_end_time | Ø slot_name | Ø sender_host | Ø sender_port | Ø conninfo | Ø When wal receiver is active: $ select * from pg_stat_get_wal_receiver(); -[ RECORD 1 ]---------+----------------------------- pid | 8576 status | streaming receive_start_lsn | 0/4000000 receive_start_tli | 1 received_lsn | 0/4000148 received_tli | 1 last_msg_send_time | 2019-12-23 12:28:52.588738+01 last_msg_receipt_time | 2019-12-23 12:28:52.588839+01 latest_end_lsn | 0/4000148 latest_end_time | 2019-12-23 11:15:43.431657+01 slot_name | Ø sender_host | /tmp sender_port | 15441 conninfo | port=15441 application_name=s When wal receiver is not running and shared memory WalRcv is reporting past activity: $ select * from pg_stat_get_wal_receiver(); -[ RECORD 1 ]---------+----------------------------- pid | Ø status | inactive receive_start_lsn | 0/4000000 receive_start_tli | 1 received_lsn | 0/4000148 received_tli | 1 last_msg_send_time | 2019-12-23 12:28:52.588738+01 last_msg_receipt_time | 2019-12-23 12:28:52.588839+01 latest_end_lsn | 0/4000148 latest_end_time | 2019-12-23 11:15:43.431657+01 slot_name | Ø sender_host | /tmp sender_port | 15441 conninfo | port=15441 application_name=s I just have a doubt about including the last three fields or setting them to NULL. Note that the information is present and might still be useful to understand what was the original source of a standby before disconnection. Regards,
pgsql-hackers by date: