Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From shveta malik
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAJpy0uCexeM-9dVe4eWgjj7G5MuNqc6CFRcrHu8fjmPmtbSZ-Q@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
List pgsql-hackers
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 10/24/23 7:44 AM, Ajin Cherian wrote:
> > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> >>           SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
> >>       }
> >>
> >> +   /* set failover in the slot, as requested */
> >> +   slot->data.failover = ctx->failover;
> >> +
> >>
> >> I think we can get rid of this change in CreateDecodingContext().
> >>
> > Yes, I too noticed this in my testing, however just removing this from
> > CreateDecodingContext will not allow us to change the slot's failover flag
> > using Alter subscription.
>
> Oh right.
>
> > I am thinking of moving this change to
> > StartLogicalReplication prior to calling CreateDecodingContext by
> > parsing the command options in StartReplicationCmd
> > without adding it to the LogicalDecodingContext.
> >
>
> Yeah, that looks like a good place to update "failover".
>
> Doing more testing and I have a couple of remarks about he current behavior.
>
> 1) Let's imagine that:
>
> - there is no standby
> - standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby)
> - then a create subscription on a subscriber WITH (failover = true) would start the
> synchronisation but never finish (means leaving a "synchronisation" slot like
"pg_32811_sync_24576_7293415241672430356"
> in place coming from ReplicationSlotNameForTablesync()).
>
> That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there
is
> a slot part of standby_slot_names which is not active/does not have an active_pid attached to it?
>

Agreed, Will do that.

> 2) When we create a subscription, another slot is created during the subscription synchronization, namely
> like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()).
>
> This extra slot appears to have failover also set to true.
>
> So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like
> on the standby:
>
> LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and
andcatalog xmin (756) 
> LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN
(0/C0034840)and catalog xmin (756) 
> LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to
passlocal slot LSN (0/C00368E8) and and catalog xmin (756) 
> WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation
>
> I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover
> during the subscription creation, better to re-launch the subscription instead?
>

'Failover' property of subscription is carried to the tablesync-slot
in recent versions only with the intent that if failover happens
during create-sub during table-sync of large tables, then users should
be able to start from that point onward on the new primary. But yes,
the above scenario is highly probable where-in no activity is
happening on primary and thus the table-sync slot is waiting for its
creation during sync on standby. So I agree, to simplify the stuff we
can skip table-sync slot syncing on standby and document this
behaviour.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Next
From: Peter Geoghegan
Date:
Subject: Re: post-recovery amcheck expectations