Thread: Re: doc: Mention clock synchronization recommendation for hot_standby_feedback

On Thu, Dec 5, 2024 at 3:14 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> One of our customers ran into a very odd case, where hot standby feedback backend_xmin propagation stopped working
dueto major (hours/days) clock time shifts on hypervisor-managed VMs. This happens (and is fully reproducible) e.g. in
scenarioswhere standby connects and its own VM is having time from the future (relative to primary) and then that time
goesback to "normal". In such situation "sends hot_standby_feedback xmin" timestamp messages are stopped being
transferred,e.g.: 
>
> 2024-12-05 02:02:35 UTC [6002]: db=,user=,app=,client= DEBUG:  sending hot standby feedback xmin 1614031 epoch 0
catalog_xmin0 catalog_xmin_epoch 0 
> 2024-12-05 02:02:45 UTC [6002]: db=,user=,app=,client= DEBUG:  sending write 6/E9015230 flush 6/E9015230 apply
6/E9015230
> 2024-12-05 02:02:45 UTC [6002]: db=,user=,app=,client= DEBUG:  sending hot standby feedback xmin 1614031 epoch 0
catalog_xmin0 catalog_xmin_epoch 0 
> <-- clock readjustment and no further "sending hot standby feedback"
...
>
> I can share reproduction steps if anyone is interested. This basically happens due to usage of
TimestampDifferenceExceeds()in XLogWalRcvSendHSFeedback(), but I bet there are other similiar scenarios. 
>

We started to use a different mechanism in HEAD. See XLogWalRcvSendHSFeedback().

> What I was kind of surprised about was the lack of recommendation for having primary/standby to have clocks synced
whenusing hot_standby_feedback, but such a thing is mentioned for recovery_min_apply_delay. So I would like to add at
leastone sentence to hot_standby_feedback to warn about this too, patch attached. 
>

IIUC, this issue doesn't occur because the primary and standby clocks
are not synchronized. It happened because the clock on standby moved
backward. This is quite unlike the 'recovery_min_apply_delay' where
non-synchronization of clocks between primary and standby can lead to
unexpected results. This is because we don't compare any time on the
primary with the time on standby. If this understanding is correct
then the wording proposed by your patch should be changed accordingly.

--
With Regards,
Amit Kapila.