Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id ZYWdSIeAMQQcLmVT@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
List pgsql-hackers
Hi,

On Fri, Dec 22, 2023 at 04:02:21PM +0530, shveta malik wrote:
> PFA v53. Changes are:

Thanks!

> patch002:
> 2) Addressed comments in [2] for v52-002.
> 3) Fixed CFBot failure. The failure was caused by an assert in
> wait_for_primary_slot_catchup() for null confirmed_lsn received. In
> wait_for_primary_slot_catchup(), we had an assumption that if
> restart_lsn is valid and 'conflicting' is also false, then we must
> have non-null confirmed_lsn. But this is not true. It is possible to
> get null values for confirmed_lsn and catalog_xmin if on the primary
> server the slot is just created with a valid restart_lsn and slot-sync
> worker has fetched the slot before the primary server could set valid
> confirmed_lsn and catalog_xmin. In
> pg_create_logical_replication_slot(), there is a small window between
> CreateInitDecodingContext-->ReplicationSlotReserveWal() which sets
> restart_lsn and DecodingContextFindStartpoint() which sets
> confirmed_lsn. If the slot-sync worker fetches the slot in this
> window, confirmed_lsn received will be NULL. Corrected the code to
> remove assert and added one additional condition that confirmed_lsn
> should be valid before moving the slot to 'r'.
> 

Looking at v53-0002 commit message:

It states:

"
If a logical slot on the primary is valid but is invalidated on the standby,
then that slot is dropped and recreated on the standby in next sync-cycle.
"

and one of the reasons mentioned is:

"
    - The primary changes wal_level to a level lower than logical.
"

I think that as long at there is still logical replication slot on the primary
that should not be possible. The primary should fail to start with messages like:

"
2023-12-22 14:06:09.281 UTC [31824] FATAL:  logical replication slot "logical_slot" exists, but wal_level < logical
"

Now, if:

- The standby is shutdown
- All the logical replication slots are removed on the primary
- wal_level is set to < logical on the primary and it is restarted

Then when the standby starts, the "synced" slots will be invalidated and later 
removed but not re-created on the next sync-cycle (because they don't exist
anymore on the primary).

Worth to reword a bit that part?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Japin Li
Date:
Subject: Re: Transaction timeout
Next
From: Junwang Zhao
Date:
Subject: Re: Transaction timeout