On Thu, May 23, 2024, at 5:54 AM, Amit Kapila wrote:
>
> Following the same line that simplifies the code, we can: (a) add a loop in
> check_subscriber() that waits until walreceiver is available on subscriber or
> (b) use a timeout. The main advantage of (a) is that the primary slot is already
> available but I'm afraid we need a escape mechanism for the loop (timeout?).
>
Sorry, it is not clear to me why we need any additional loop in
check_subscriber(), aren't we speaking about the problem in
check_publisher() function?
The idea is to use check_subscriber() to check pg_stat_walreceiver. Once this
view returns a row and primary_slot_name is set on standby, the referred
replication slot name should be active on primary. Hence, the query on
check_publisher() make sure that the referred replication slot is in use on
primary.
Why in the first place do we need to ensure that primary_slot_name is
active on the primary? You mentioned something related to WAL
retention but I don't know how that is related to this tool's
functionality. If at all, we are bothered about WAL retention on the
primary that should be the WAL corresponding to consistent_lsn
computed by setup_publisher() but this check doesn't seem to ensure
that.
Maybe it is a lot of checks. I'm afraid there isn't a simple way to get and
make sure the replication slot is used by the physical replication. I mean if
there is primary_slot_name = 'foo' on standby, there is no guarantee that the
replication slot 'foo' exists on primary. The idea is to get the exact
replication slot name used by physical replication to drop it. Once I posted a
patch it should be clear. (Another idea is to relax this check and rely only on
primary_slot_name to drop this replication slot on primary. The replication slot
might not exist and it shouldn't return an error in this case.)