Re: standby apply lag on inactive servers - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: standby apply lag on inactive servers |
Date | |
Msg-id | ee580c97-b552-4a9f-4bf3-2fed965ab3b9@oss.nttdata.com Whole thread Raw |
In response to | Re: standby apply lag on inactive servers (Alvaro Herrera <alvherre@2ndquadrant.com>) |
List | pgsql-hackers |
On 2020/01/31 23:47, Alvaro Herrera wrote: > On 2020-Jan-31, Fujii Masao wrote: >> On 2020/01/31 22:40, Alvaro Herrera wrote: >>> On 2020-Jan-31, Fujii Masao wrote: >>> >>>> You're thinking to apply this change to the back branches? Sorry >>>> if my understanding is not right. But I don't think that back-patch >>>> is ok because it changes the documented existing behavior >>>> of pg_last_xact_replay_timestamp(). So it looks like the behavior >>>> change not a bug fix. >>> >>> Yeah, I am thinking in backpatching it. The documented behavior is >>> already not what the code does. >> >> Maybe you thought this because getRecordTimestamp() extracts the >> timestamp from even WAL record of a restore point? That is, you're >> concerned about that pg_last_xact_replay_timestamp() returns the >> timestamp of not only commit/abort record but also restore point one. >> Right? > > right. > >> As far as I read the code, this problem doesn't occur because >> SetLatestXTime() is called only for commit/abort records, in >> recoveryStopsAfter(). No? > > ... uh, wow, you're right about that too. IMO this is extremely > fragile, easy to break, and under-documented. Yeah, it's worth improving the code. > But you're right, there's > no bug there at present. > >>> Do you have a situation where this >>> change would break something? If so, can you please explain what it is? >> >> For example, use the return value of pg_last_xact_replay_timestamp() >> (and also the timestamp in the log message output at the end of >> recovery) as a HINT when setting recovery_target_time later. > > Hmm. > > I'm not sure how you would use it in that way. I mean, I understand how > it *can* be used that way, but it seems too fragile to be done in > practice, in a scenario that's not just laboratory games. > >> Use it to compare with the timestamp retrieved from the master server, >> in order to monitor the replication delay. > > That's precisely the use case that I'm aiming at. The timestamp > currently is not useful because this usage breaks when the primary is > inactive (no COMMIT records occur). During such periods of inactivity, > CHECKPOINT records would keep the "last xtime" current. This has > actually happened in a production setting, it's not a thought > experiment. I've heard that someone periodically generates dummy tiny transactions (say, every minute), as a band-aid solution, to avoid inactive primary. Of course, this is not a perfect solution. The idea that I proposed previously was to introduce pg_last_xact_insert_timestamp() [1] into core. This function returns the timestamp of commit / abort records in *primary* side. So we can retrieve that timestamp from the primary (e.g., by using dblink) and compare its result with pg_last_xact_replay_timestamp() to calculate the delay in the standby. Another idea is to include the commit / abort timestamp in primary-keepalive-message that periodially sent from the primary to the standby. Then if we introduce the function returning that timestamp, in the standby side, we can easily compare the commit / abort timestamps taken from both primary and standby, in the standby. [1] https://www.postgresql.org/message-id/CAHGQGwF3ZjfuNEj5ka683KU5rQUBtSWtqFq7g1X0g34o+JXWBw@mail.gmail.com Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
pgsql-hackers by date: