Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uCwEr6c5MGzvir1sM1fORFUHPqyX6fMz3LE2_TNr_hw0g@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/27/23 11:56 AM, shveta malik wrote: > > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 10/25/23 5:00 AM, shveta malik wrote: > >>> On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> On 10/23/23 2:56 PM, shveta malik wrote: > >>>>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>>>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>>>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > >>>>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > >>>>>> for V1? > >>>>>> > >>>>> > >>>>> I think for the slotsync workers case, we should reduce the naptime in > >>>>> the launcher to say 30sec and retain the default one of 3mins for > >>>>> subscription apply workers. Thoughts? > >>>>> > >>>> > >>>> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >>>> API on the standby that would refresh the list of sync slot at wish, thoughts? > >>>> > >>> > >>> Do you mean API to refresh list of DBIDs rather than sync-slots? > >>> As per current design, launcher gets DBID lists for all the failover > >>> slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > >> > >> I mean an API to get a newly created slot on the primary being created/synced on > >> the standby at wish. > >> > >> Also let's imagine this scenario: > >> > >> - create logical_slot1 on the primary (and don't start using it) > >> > >> Then on the standby we'll get things like: > >> > >> 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin(752) to pass local slot LSN (0/C0049530) and and catalog xmin (754) > >> > >> That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > >> restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > >> > >> - create logical_slot2 on the primary (and start using it) > >> > >> Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > >> that would produce things like: > >> 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > > > > > Slight correction to above. As soon as we start activity on > > logical_slot2, it will impact all the slots on primary, as the WALs > > are consumed by all the slots. So even if there is activity on > > logical_slot2, logical_slot1 creation on standby will be unblocked and > > it will then move to logical_slot2 creation. eg: > > > > --on standby: > > 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot > > "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local > > slot LSN (0/3C979A8) and and catalog xmin (756) > > > > on primary: > > newdb1=# select now(); > > now > > ---------------------------------- > > 2023-10-27 15:15:51.504835+05:30 > > (1 row) > > > > --activity on mysubnew1_3 > > newdb1=# insert into tab1_3 values(1); > > INSERT 0 1 > > newdb1=# select now(); > > now > > ---------------------------------- > > 2023-10-27 15:15:54.651406+05:30 > > > > > > --on standby, mysubnew1_1 is unblocked. > > 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot > > "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now > > passed local slot LSN (0/3C979A8) and catalog xmin (756) > > > > My Setup: > > mysubnew1_1 -->mypubnew1_1 -->tab1_1 > > mysubnew1_3 -->mypubnew1_3-->tab1_3 > > > > Agree with your test case, but in my case I was not using pub/sub. > > I was not clear, so when I said: > > >> - create logical_slot1 on the primary (and don't start using it) > > I meant don't start decoding from it (like using pg_recvlogical() or > pg_logical_slot_get_changes()). > > By using pub/sub the "don't start using it" is not satisfied. > > My test case is: > > " > SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); > SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); > pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - > " > Okay, I am able to reproduce it now. Thanks for clarification. I have tried to change the algorithm as per suggestion by Amit in [1] [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com This is not full proof solution but optimization over first one. Now in any sync-cycle, we take 2 attempts for slots-creation (if any slots are available to be created). In first attempt, we do not wait indefinitely on inactive slots, we wait only for a fixed amount of time and if remote-slot is still behind, then we add that to the pending list and move to the next slot. Once we are done with first attempt, in second attempt, we go for the pending ones and now we wait on each of them until the primary catches up. > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: