Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uCexeM-9dVe4eWgjj7G5MuNqc6CFRcrHu8fjmPmtbSZ-Q@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>) |
List | pgsql-hackers |
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/24/23 7:44 AM, Ajin Cherian wrote: > > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > >> SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > >> } > >> > >> + /* set failover in the slot, as requested */ > >> + slot->data.failover = ctx->failover; > >> + > >> > >> I think we can get rid of this change in CreateDecodingContext(). > >> > > Yes, I too noticed this in my testing, however just removing this from > > CreateDecodingContext will not allow us to change the slot's failover flag > > using Alter subscription. > > Oh right. > > > I am thinking of moving this change to > > StartLogicalReplication prior to calling CreateDecodingContext by > > parsing the command options in StartReplicationCmd > > without adding it to the LogicalDecodingContext. > > > > Yeah, that looks like a good place to update "failover". > > Doing more testing and I have a couple of remarks about he current behavior. > > 1) Let's imagine that: > > - there is no standby > - standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby) > - then a create subscription on a subscriber WITH (failover = true) would start the > synchronisation but never finish (means leaving a "synchronisation" slot like "pg_32811_sync_24576_7293415241672430356" > in place coming from ReplicationSlotNameForTablesync()). > > That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there is > a slot part of standby_slot_names which is not active/does not have an active_pid attached to it? > Agreed, Will do that. > 2) When we create a subscription, another slot is created during the subscription synchronization, namely > like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). > > This extra slot appears to have failover also set to true. > > So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like > on the standby: > > LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and andcatalog xmin (756) > LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840)and catalog xmin (756) > LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to passlocal slot LSN (0/C00368E8) and and catalog xmin (756) > WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation > > I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover > during the subscription creation, better to re-launch the subscription instead? > 'Failover' property of subscription is carried to the tablesync-slot in recent versions only with the intent that if failover happens during create-sub during table-sync of large tables, then users should be able to start from that point onward on the new primary. But yes, the above scenario is highly probable where-in no activity is happening on primary and thus the table-sync slot is waiting for its creation during sync on standby. So I agree, to simplify the stuff we can skip table-sync slot syncing on standby and document this behaviour. thanks Shveta
pgsql-hackers by date: