On Tue, Dec 5, 2023 at 7:38 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 12/5/23 12:32 PM, Amit Kapila wrote:
> > On Tue, Dec 5, 2023 at 10:38 AM shveta malik <shveta.malik@gmail.com> wrote:
> >>
> >> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
> >> <bertranddrouvot.pg@gmail.com> wrote:
> >>>>
> >>>
> >>> Maybe another option could be to have the walreceiver a way to let the slot sync
> >>> worker knows that it (the walreceiver) was not able to start due to non existing
> >>> replication slot on the primary? (that way we'd avoid the slot sync worker having
> >>> to talk to the primary).
> >>
> >> Few points:
> >> 1) I think if we do it, we should do it in generic way i.e. slotsync
> >> worker should go to no-op if walreceiver is not able to start due to
> >> any reason and not only due to invalid primary_slot_name.
> >> 2) Secondly, slotsync worker needs to make sure it has synced the
> >> slots so far i.e. worker should not go to no-op immediately on seeing
> >> missing WalRcv process if there are pending slots to be synced.
> >>
> >
> > Won't it be better to just ping and check the validity of
> > 'primary_slot_name' at the start of slot-sync and if it is changed
> > anytime? I think it would be better to avoid adding dependency on
> > walreciever state as that sounds like needless complexity.
>
> I think the overall extra complexity is linked to the fact that we first
> want to ensure that the slots are in sync before shutting down the
> sync slot worker.
>
> I think than talking to the primary or relying on the walreceiver state
> is "just" what would trigger the decision to shutdown the sync slot worker.
>
> Relying on the walreceiver state looks better to me (as it avoids possibly
> useless round trips with the primary).
>
But the round trip will only be once in the beginning and if the user
changes the GUC primary-slot_name which shouldn't be that often.
> Also the walreceiver could be down for multiple reasons, and I think there
> is no point of having a sync slot worker running if the slots are in sync and
> there is no walreceiver running (even if primary_slot_name is a valid one).
>
I feel that is indirectly relying on the fact that the primary won't
advance logical slots unless physical standby has consumed data. Now,
it is possible that slot-sync worker lags behind and still needs to
sync more data for slots in which it makes sense for slot-sync worker
to be alive. I think we can try to avoid checking walreceiver status
till we can get more data to avoid the problem I mentioned but it
doesn't sound like a clean way to achieve our purpose.
--
With Regards,
Amit Kapila.