Re: BUG #18789: logical replication slots are deleted after failovers - Mailing list pgsql-bugs

From Sachin Konde-Deshmukh
Subject Re: BUG #18789: logical replication slots are deleted after failovers
Date
Msg-id CAHkC50uxcgmTDL30J8j+i_uFGtK2=0qb2idkMB2dY4U6taeL9Q@mail.gmail.com
Whole thread Raw
In response to BUG #18789: logical replication slots are deleted after failovers  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
Hi  Hayato,

For better understanding, can you clarify 1) network configuration you created
and 2) actual nodes queries were run?
Four instances are needed to do a failover third time, but not sure how they connected.

Only 2 nodes are in cluster,

I am using Patroni on these two nodes.

Below is the configuration in patroni.yml file
-------------------------------------------------
patrony.yml --> 

namespace: PostgreSQL_Cluster
scope: postgresHA
name: psoel89pgcluster01

restapi:
   listen: 0.0.0.0:8008
   connect_address: 10.3.82.14:8008

etcd3:
   hosts: 10.3.82.14:2379,10.3.82.15:2379,10.3.82.16:2379

bootstrap:
# this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
dcs:
     ttl: 30
     loop_wait: 10
     retry_timeout: 10
     maximum_lag_on_failover: 1048576

     postgresql:
         use_pg_rewind: true
         use_slots: true
         slots:
           mysub:
             type: logical
             database: postgres
             plugin: pgoutput
             failover: true
             synced: true
         parameters:
             wal_level: logical
             hot_standby: 'on'
             wal_keep_segments: 10
             max_wal_senders: 5
             max_replication_slots: 10
             wal_log_hints: 'on'
             sync_replication_slots: 'on'
             hot_standby_feedback: 'on'
             max_logical_replication_workers: 20
             logging_collector: 'on'
             max_wal_size: '10GB'
             archive_mode: 'on'
             archive_timeout: 600s
             archive_command: cp -f %p /home/postgres/archived/%f
             shared_preload_libraries: 'pg_failover_slots'
# some desired options for 'initdb'
initdb: # Note: It needs to be a list (some options need values, others are switches)
     - encoding: UTF8
     - data-checksums

pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
     - host replication replicator 127.0.0.1/32 trust
     - host replication replicator 0.0.0.0/0 trust
     - host all all 0.0.0.0/0 trust
     - host all all ::0/0 trust

# Some additional users which needs to be created after initializing new cluster
users:
     admin:
         password: ******
         options:
             - createrole
             - createdb
     splex:
         password: *****
         options:
             - createrole
             - createdb
postgresql:
   listen: 0.0.0.0:5432
   connect_address: 10.3.82.14:5432
   data_dir: /u02/app/postgres/17/data
   bin_dir: /usr/pgsql-17/bin
   pgpass: /tmp/pgpass0
   authentication:
       replication:
           username: replicator
           password: **********
       superuser:
           username: postgres
           password: ********
   parameters:
       unix_socket_directories: /var/run/postgresql/
       primary_conninfo: 'host=10.3.82.15 port=5432 user=replicator password=******* dbname=postgres application_name=%p'
   create_replica_methods:
       - basebackup
   basebackup:
       checkpoint: 'fast'
#    replication:
   slots:
     mysub:
       type: logical
       database: postgres
       plugin: pgoutput
       failover: true
       synced: true
tags:
   nofailover: false
   noloadbalance: false
   clonefrom: false
   nosync: false
hooks:
on_failover: "/etc/patroni/set_all_logical_slot_status.sh"
on_switchover: "/etc/patroni/set_all_logical_slot_status.sh"
on_start: "/etc/patroni/set_all_logical_slot_status.sh"
on_stop: "/etc/patroni/set_all_logical_slot_status.sh"
-------------------------------------
/etc/patroni/set_all_logical_slot_status.sh
#!/bin/bash

# PostgreSQL host and credentials (if needed)

PG_HOST="10.3.82.17"
PG_PORT="5000"
PG_USER="postgres"

# Connect to PostgreSQL and list all logical replication slots
SLOT_LIST=$(psql -U $PG_USER -h $PG_HOST -p $PG_PORT -t -c "SELECT slot_name, active, pg_is_in_recovery() FROM pg_replication_slots;")

# Loop through each slot and update the failover and synced status if necessary
while IFS='|' read -r slot_name active is_in_recovery; do
   # Check if the slot is active and if the failover and synced status needs to be updated
   if [[ "$active" == "t" && "$is_in_recovery" == "f" ]]; then
       echo "Checking slot: $slot_name"
       # Check if the failover and synced status is false (we are assuming you have custom logic for this)
       # Update the status if necessary, example query to update
       psql -U $PG_USER -h $PG_HOST -p $PG_PORT -c "SELECT pg_create_logical_replication_slot('$slot_name', 'pgoutput', true);"
       echo "Updated slot $slot_name to TRUE for failover and synced"
   fi
done <<< "$SLOT_LIST"

exit 0

--------------------------------------------------
I have added the script set_all_logical_slot_status.sh later to update the status of slot but its not helping as well.


Hi Amit,

This data appears suspicious to me because we shouldn't be changing
the 'failover' property of the slot. Also, note that physical slot
names are different in the above two results. Either the user has
changed the 'failover' property of the slot or the above data is from
completely two different clusters one where the 'failover' property
for the slot is enabled and another where it is not enabled.


We are not changing the property explicitly. This is getting changed after 2/3 failover operations.
For very first attempt it works as expected.

In my cluster setup we only have two nodes. Primary and Standby.
The other VM which is subscriber is not part of cluster setup.





-------------------------------------------------------------------
Sachin Konde

9762777853


On Sat, Feb 1, 2025 at 11:05 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jan 30, 2025 at 12:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 29, 2025 at 7:01 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference:      18789
> > Logged by:          Sachin Konde-Deshmukh
> > Email address:      sachinkonde3@gmail.com
> > PostgreSQL version: 17.2
> > Operating system:   Oracle Linux 8.9
> > Description:
> >
> > We are using 2 node PostgreSQL 17 HA setup using Patroni 4.0.4.
> > When I do failover 2nd or third time or more than once, it fails to transfer
> > or move logical replication slot to new Primary.
> > postgres=# select slot_name,slot_type, failover,
> > synced,confirmed_flush_lsn,active from pg_replication_slots;
> >      slot_name      | slot_type | failover | synced | confirmed_flush_lsn |
> > active
> > --------------------+-----------+----------+--------+---------------------+--------
> > psoel89pgcluster01 | physical  | f        | f      |                     |
> > t
> > mysub              | logical   | t        | t      | 0/4000AB8           |
> > t
> > (2 rows)
>
> I guess that this is the list of slots on the primary.
>
> > After First Failover -->
> > postgres=# select slot_name,slot_type, failover,
> > synced,confirmed_flush_lsn,active from pg_replication_slots;
> >      slot_name      | slot_type | failover | synced | confirmed_flush_lsn |
> > active
> > --------------------+-----------+----------+--------+---------------------+--------
> > psoel89pgcluster02 | physical  | f        | f      |                     |
> > t
> > mysub              | logical   | f        | f      | 0/50001E0           |
> > t
> > (2 rows)
>
> I guess that this is the list of slots on the new primary after a
> failover.
>

This data appears suspicious to me because we shouldn't be changing
the 'failover' property of the slot. Also, note that physical slot
names are different in the above two results. Either the user has
changed the 'failover' property of the slot or the above data is from
completely two different clusters one where the 'failover' property
for the slot is enabled and another where it is not enabled.

--
With Regards,
Amit Kapila.

pgsql-bugs by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: Bug in psql
Next
From: PG Bug reporting form
Date:
Subject: BUG #18792: Segmentaion Fault error when changing new parameter synchronized_standby_slots