Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Drouvot, Bertrand
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id 84a63890-b2fe-4826-8c77-aa6ad9bcd460@gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
List pgsql-hackers
Hi,

On 10/4/23 6:26 AM, shveta malik wrote:
> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Oct 3, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote:
>>>
>>> On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand
>>> <bertranddrouvot.pg@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 10/3/23 12:54 PM, Amit Kapila wrote:
>>>>> On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand
>>>>> <bertranddrouvot.pg@gmail.com> wrote:
>>>>>>
>>>>>> On 9/29/23 1:33 PM, Amit Kapila wrote:
>>>>>>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
>>>>>>> <bertranddrouvot.pg@gmail.com> wrote:
>>>>>>>>
>>>>>>>
>>>>>>>> - probably open corner cases like: what if a standby is down? would that mean
>>>>>>>> that synchronize_slot_names not being send to the primary would allow the decoding
>>>>>>>> on the primary to go ahead?
>>>>>>>>
>>>>>>>
>>>>>>> Good question. BTW, irrespective of whether we have
>>>>>>> 'standby_slot_names' parameters or not, how should we behave if
>>>>>>> standby is down? Say, if 'synchronize_slot_names' is only specified on
>>>>>>> standby then in such a situation primary won't be even aware that some
>>>>>>> of the logical walsenders need to wait.
>>>>>>
>>>>>> Exactly, that's why I was thinking keeping standby_slot_names to address
>>>>>> this scenario. In such a case one could simply decide to keep or remove
>>>>>> the associated physical replication slot from standby_slot_names. Keep would
>>>>>> mean "wait" and removing would mean allow to decode on the primary.
>>>>>>
>>>>>>> OTOH, one can say that users
>>>>>>> should configure 'synchronize_slot_names' on both primary and standby
>>>>>>> but note that this value could be different for different standby's,
>>>>>>> so we can't configure it on primary.
>>>>>>>
>>>>>>
>>>>>> Yeah, I think that's a good use case for standby_slot_names, what do you think?
>>>>>>
>>>>>
>>>>> But, even if we keep 'standby_slot_names' for this purpose, the
>>>>> primary doesn't know the value of 'synchronize_slot_names' once the
>>>>> standby is down and or the primary is restarted. So, how will we know
>>>>> which logical WAL senders needs to wait for 'standby_slot_names'?
>>>>>
>>>>
>>>> Yeah right, I also think we'd need:
>>>>
>>>> - synchronize_slot_names on both primary and standby
>>>>
>>>> But now we would need to take care of different standby having different values (
>>>> as you said up-thread)....
>>>>
>>>> Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor
>>>> synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say
>>>> "logical_slot_name:physical_slot".
>>>>
>>>> I think this GUC would help us define each walsender behavior (should the standby(s)
>>>> be up or down):
>>>>
>>>
>>> It may help in defining the walsender's behaviour better for sure. But
>>> the problem I see once we start defining sync-slot-names on primary
>>> (in any form whether as independent GUC or as above mapping GUC) is
>>> that it needs to be then in sync with standbys, as each standby for
>>> sure needs to maintain its own sync-slot-names GUC to make it aware of
>>> what all it needs to sync.
>>
>> Yes, I also think so. Also, defining such a GUC where user wants to
>> sync all the slots which would normally be the case would be a night
>> mare for the users.
>>
>>>
>>> This brings us to the original question of
>>> how do we actually keep these configurations in sync between primary
>>> and standby if we plan to maintain it on both?
>>>
>>>
>>>> - don't wait if its associated logical_slot is not listed in this GUC
>>>> - or wait based on its associated "list" of mapped physical slots (would probably
>>>> have to deal with the min restart_lsn for all the corresponding mapped ones).
>>>>
>>>> I don't think we can avoid having to define at least one GUC on the primary (at least to
>>>> handle the case of standby(s) being down).
>>>>
>>
>> How about an alternate scheme where we define sync_slot_names on
>> standby but then store the physical_slot_name in the corresponding
>> logical slot (ReplicationSlotPersistentData) to be synced? So, the
>> standby will send the list of 'sync_slot_names' and the primary will
>> add the physical standby's slot_name in each of the corresponding
>> sync_slot. Now, if we do this then even after restart, we should be
>> able to know for which physical slot each logical slot needs to wait.
>> We can even provide an SQL API to reset the value of
>> standby_slot_names in logical slots as a way to unblock decoding in
>> case of emergency (for example, corresponding when physical standby
>> never comes up).
>>
> 
> 
> Looks like a better approach to me. It solves most of the pain points like:
> 1) Avoids the need of multiple GUCs
> 2) Primary and standby need not to worry to be in sync if we maintain
> sync-slot-names GUC on both
> 3) User still gets the flexibility to remove a standby from wait-lost
> of primary's logical-walsenders' using reset SQL API.
> 

Fully agree.

> Now some initial thoughts:
> 1) Since each logical slot could be needed to be synched by multiple
> physical-standbys, so in ReplicationSlotPersistentData, we need to
> hold a list of standby's name. So this brings us to question as in how
> much shall we allocate initially in shared-memory? Shall it be for
> max_replication_slots (worst case scenario) in each
> ReplicationSlotPersistentData to hold physical-standby names?
> 

Yeah, and even if we do the opposite means add the 'to-sync'
logical replication slot in the ReplicationSlotPersistentData of the physical
slot(s) the questions still remain (as a physical standby could want to
sync multiples slots)

> 2) If standby sends '*', then we need to update each logical-slot with
> that standby-name. Or do we have better way to deal with '*'? Need to
> think more on this.
> 
> JFYI, on the similar line, currently in ReplicationSlotPersistentData,
> we are maintaining a flag for slot-sync feature which is:
> 
>          bool            synced; /* Is this a slot created by a
> sync-slot worker? */
> 
> This flag currently holds significance only on physical-standby. This
> has been added to distinguish between a slot created by user for
> logical decoding purpose and the ones being synced from primary. 

BTW, what about having this "user visible" through pg_replication_slots?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector
Next
From: "Drouvot, Bertrand"
Date:
Subject: Re: Synchronizing slots from primary to standby