On Wed, May 22, 2024, at 8:19 AM, Amit Kapila wrote:
>
> v2-0001: not changed
>
Shouldn't we modify it as per the suggestion given in the email [1]? I
am wondering if we can entirely get rid of checking the primary
business and simply rely on recovery_timeout and keep checking
server_is_in_recovery(). If so, we can modify the test to use
non-default recovery_timeout (say 180s or something similar if we have
used it at any other place). As an additional check we can ensure that
constent_lsn is present on standby.
That's exactly what I want to propose as Tomas convinced me offlist that less is
better when we don't have a useful recovery progress reporting mechanism to make
sure it is still working on the recovery and we should wait.
> v2-0002: not changed
>
We have added more tries to see if the primary_slot_name becomes
active but I think it is still fragile because it is possible on slow
machines that the required slot didn't become active even after more
retries. I have raised the same comment previously [2] and asked an
additional question but didn't get any response.
Following the same line that simplifies the code, we can: (a) add a loop in
check_subscriber() that waits until walreceiver is available on subscriber or
(b) use a timeout. The main advantage of (a) is that the primary slot is already
available but I'm afraid we need a escape mechanism for the loop (timeout?).
I'll summarize all issues as soon as I finish the review of sync slot support. I
think we should avoid new development if we judge that the item can be
documented as a limitation for this version. Nevertheless, I will share patches
so you can give your opinion on whether it is an open item or new development.