Thread: Re: doc: Mention clock synchronization recommendation for hot_standby_feedback
Re: doc: Mention clock synchronization recommendation for hot_standby_feedback
From
Amit Kapila
Date:
On Thu, Dec 5, 2024 at 3:14 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > One of our customers ran into a very odd case, where hot standby feedback backend_xmin propagation stopped working dueto major (hours/days) clock time shifts on hypervisor-managed VMs. This happens (and is fully reproducible) e.g. in scenarioswhere standby connects and its own VM is having time from the future (relative to primary) and then that time goesback to "normal". In such situation "sends hot_standby_feedback xmin" timestamp messages are stopped being transferred,e.g.: > > 2024-12-05 02:02:35 UTC [6002]: db=,user=,app=,client= DEBUG: sending hot standby feedback xmin 1614031 epoch 0 catalog_xmin0 catalog_xmin_epoch 0 > 2024-12-05 02:02:45 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015230 flush 6/E9015230 apply 6/E9015230 > 2024-12-05 02:02:45 UTC [6002]: db=,user=,app=,client= DEBUG: sending hot standby feedback xmin 1614031 epoch 0 catalog_xmin0 catalog_xmin_epoch 0 > <-- clock readjustment and no further "sending hot standby feedback" ... > > I can share reproduction steps if anyone is interested. This basically happens due to usage of TimestampDifferenceExceeds()in XLogWalRcvSendHSFeedback(), but I bet there are other similiar scenarios. > We started to use a different mechanism in HEAD. See XLogWalRcvSendHSFeedback(). > What I was kind of surprised about was the lack of recommendation for having primary/standby to have clocks synced whenusing hot_standby_feedback, but such a thing is mentioned for recovery_min_apply_delay. So I would like to add at leastone sentence to hot_standby_feedback to warn about this too, patch attached. > IIUC, this issue doesn't occur because the primary and standby clocks are not synchronized. It happened because the clock on standby moved backward. This is quite unlike the 'recovery_min_apply_delay' where non-synchronization of clocks between primary and standby can lead to unexpected results. This is because we don't compare any time on the primary with the time on standby. If this understanding is correct then the wording proposed by your patch should be changed accordingly. -- With Regards, Amit Kapila.