Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uDv01ctC3z7fV3cbvVw8o+micn7zkD+VBxFm4TnsQh3OQ@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Fri, Oct 27, 2023 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 10/25/23 5:00 AM, shveta malik wrote: > > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >> Hi, > > >> > > >> On 10/23/23 2:56 PM, shveta malik wrote: > > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > > >>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > > >>>> for V1? > > >>>> > > >>> > > >>> I think for the slotsync workers case, we should reduce the naptime in > > >>> the launcher to say 30sec and retain the default one of 3mins for > > >>> subscription apply workers. Thoughts? > > >>> > > >> > > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > > >> API on the standby that would refresh the list of sync slot at wish, thoughts? > > >> > > > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > > As per current design, launcher gets DBID lists for all the failover > > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > > > I mean an API to get a newly created slot on the primary being created/synced on > > the standby at wish. > > > > Also let's imagine this scenario: > > > > - create logical_slot1 on the primary (and don't start using it) > > > > Then on the standby we'll get things like: > > > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin(752) to pass local slot LSN (0/C0049530) and and catalog xmin (754) > > > > That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > > > - create logical_slot2 on the primary (and start using it) > > > > Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > > that would produce things like: > > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > > Slight correction to above. As soon as we start activity on > logical_slot2, it will impact all the slots on primary, as the WALs > are consumed by all the slots. So even if there is activity on > logical_slot2, logical_slot1 creation on standby will be unblocked and > it will then move to logical_slot2 creation. eg: > > --on standby: > 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot > "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local > slot LSN (0/3C979A8) and and catalog xmin (756) > > on primary: > newdb1=# select now(); > now > ---------------------------------- > 2023-10-27 15:15:51.504835+05:30 > (1 row) > > --activity on mysubnew1_3 > newdb1=# insert into tab1_3 values(1); > INSERT 0 1 > newdb1=# select now(); > now > ---------------------------------- > 2023-10-27 15:15:54.651406+05:30 > > > --on standby, mysubnew1_1 is unblocked. > 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot > "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now > passed local slot LSN (0/3C979A8) and catalog xmin (756) > > My Setup: > mysubnew1_1 -->mypubnew1_1 -->tab1_1 > mysubnew1_3 -->mypubnew1_3-->tab1_3 > > thanks > Shveta PFA v26 patches. The changes are: 1) 'Failover' in the main slot is now set when the table synchronization phase is finished. So even when failover is enabled for a subscription, the internal failover state remains temporarily “pending” until the initialization phase completes. 2) If the standby is down, but standby_slot_names has that slot name, we emit a warning now while waiting for that standby. 3) Fixed bug where pg_logical_slot_get_changes was resetting failover property of slot. Thanks Ajin for providing the fix. 4) Fixed bug where standby_slot_names_list was not initialized for non-walsender cases making pg_logical_slot_get_changes() to proceed w/o waiting for standbys. 5) Fixed a bug where standby_slot_names_list was freed (due to free of per_query context in non-walsender cases) but was not nullified and thus next call was using this freed pointer and was crashing. 6) Improved wait_for_primary_slot_catchup(), we now fetch remote-conflicting(invalidation) too and abort the wait and slot creation if the slot on primary is invalidated. 7) Slot-sync workers now wait for cascading standby's confirmation before updating logical synced slots on first standby. First 5 changes are in patch001, 6th one is in patch002. For 7th, I have created a new patch (003) to separate out the additional changes needed for cascading standbys. ========== Open questions regarding change for pt 1 above: a) I think we should restrict the 'alter-sub set failover' when failover-state is currently in 'p' (pending) state i.e. table-sync is going over. Once table-sync is over, then toggle of 'failover' should be allowed using alter-subscription. b) Currently I have restricted 'alter subscription.. refresh publication with copy=true' when failover=true (on a similar line of two-phase). The reason being, refresh with copy=true will go for table-sync again and since failover was set in main-slot after table-sync was done, it will need going through the same transition of 'p' to 'e' for main slot making it unsyncable for that time. Should it be allowed? Currently: newdb1=# ALTER SUBSCRIPTION mysubnew1_1 REFRESH PUBLICATION WITH (copy_data=true); ERROR: ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when failover is enabled HINT: Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false, or use DROP/CREATE SUBSCRIPTION. Thoughts on above queries? thanks Shveta
Attachment
pgsql-hackers by date: