Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Drouvot, Bertrand
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id da2d3264-7049-48b1-914a-9c8631c8e384@gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Ajin Cherian <itsajin@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby
List pgsql-hackers
Hi,

On 10/24/23 7:44 AM, Ajin Cherian wrote:
> On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>>           SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
>>       }
>>
>> +   /* set failover in the slot, as requested */
>> +   slot->data.failover = ctx->failover;
>> +
>>
>> I think we can get rid of this change in CreateDecodingContext().
>>
> Yes, I too noticed this in my testing, however just removing this from
> CreateDecodingContext will not allow us to change the slot's failover flag
> using Alter subscription.

Oh right.

> I am thinking of moving this change to
> StartLogicalReplication prior to calling CreateDecodingContext by
> parsing the command options in StartReplicationCmd
> without adding it to the LogicalDecodingContext.
> 

Yeah, that looks like a good place to update "failover".

Doing more testing and I have a couple of remarks about he current behavior.

1) Let's imagine that:

- there is no standby
- standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby)
- then a create subscription on a subscriber WITH (failover = true) would start the
synchronisation but never finish (means leaving a "synchronisation" slot like
"pg_32811_sync_24576_7293415241672430356"
in place coming from ReplicationSlotNameForTablesync()).

That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there is
a slot part of standby_slot_names which is not active/does not have an active_pid attached to it?

2) When we create a subscription, another slot is created during the subscription synchronization, namely
like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()).

This extra slot appears to have failover also set to true.

So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like
on the standby:

LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and
andcatalog xmin (756)
 
LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN
(0/C0034840)and catalog xmin (756)
 
LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to pass
localslot LSN (0/C00368E8) and and catalog xmin (756)
 
WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation

I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover
during the subscription creation, better to re-launch the subscription instead?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Alena Rybakina
Date:
Subject: Re: Simplify create_merge_append_path a bit for clarity
Next
From: Dean Rasheed
Date:
Subject: Re: Bug: RLS policy FOR SELECT is used to check new rows