Thread: Re: Replication slot is not able to sync up

Re: Replication slot is not able to sync up

From
Amit Kapila
Date:
On Fri, May 23, 2025 at 9:57 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Hi,

Noticed below behaviour where replication slot is not able to sync up if any catalog changes happened after the creation.
Getting below LOG when trying to sync replication slots using pg_sync_replication_slots() function. 
The newly created slot does not appear on the standby after this LOG -

2025-05-23 07:57:12.453 IST [4178805] LOG:  could not synchronize replication slot "failover_slot" because remote slot precedes local slot
2025-05-23 07:57:12.453 IST [4178805] DETAIL:  The remote slot has LSN 0/B000060 and catalog xmin 764, but the local slot has LSN 0/B000060 and catalog xmin 765.
2025-05-23 07:57:12.453 IST [4178805] STATEMENT:  SELECT pg_sync_replication_slots();

Below is the test case tried on latest master branch -
=========
- Create the Primary and start the server.
wal_level = logical

- Create the physical slot on Primary.
SELECT pg_create_physical_replication_slot('slot1');

- Setup the standby using pg_basebackup.
bin/pg_basebackup -D data1 -p 5418 -d "dbname=postgres" -R

primary_slot_name = 'slot1'
hot_standby_feedback = on
port = 5419

-- Start the standby.

-- Connect to Primary and create a logical replication slot.
SELECT pg_create_logical_replication_slot('failover_slot', 'pgoutput', false, false, true);

postgres@4177929=#select xmin,* from pg_replication_slots ;
 xmin |   slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | two_phase_at |          inactive_since          | conflicting | invalidation_reason | failover | synced
------+---------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+--------------+----------------------------------+-------------+---------------------+----------+--------
  765 | slot1         |          | physical  |        |          | f         | t      |    4177898 |  765 |              | 0/B018B00   |                     | reserved   |               | f      
  |              |                                  |             |                     | f        | f
      | failover_slot | pgoutput | logical   |      5 | postgres | f         | f      |            |      |          764 | 0/B000060   | 0/B000098           | reserved   |               | f      
  |              | 2025-05-23 07:55:31.277584+05:30 | f           |                     | t        | f
(2 rows)

-- Perform some catalog changes. e.g.:
create table abc(id int);
postgres@4179034=#select xmin from pg_class where relname='abc';
 xmin
------
  764
(1 row)

-- Connect to the standby and try to sync the replication slots.
SELECT pg_sync_replication_slots();

In the logfile, can see below LOG -
2025-05-23 07:57:12.453 IST [4178805] LOG:  could not synchronize replication slot "failover_slot" because remote slot precedes local slot
2025-05-23 07:57:12.453 IST [4178805] DETAIL:  The remote slot has LSN 0/B000060 and catalog xmin 764, but the local slot has LSN 0/B000060 and catalog xmin 765.
2025-05-23 07:57:12.453 IST [4178805] STATEMENT:  SELECT pg_sync_replication_slots();

select xmin,* from pg_replication_slots ;
no rows

Primary -
postgres@4179034=#select xmin,* from pg_replication_slots ;
 xmin |   slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | two_phase_at |          inactive_since          | conflicting | invalidation_reason | failover | synced
------+---------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+--------------+----------------------------------+-------------+---------------------+----------+--------
  765 | slot1         |          | physical  |        |          | f         | t      |    4177898 |  765 |              | 0/B018C08   |                     | reserved   |               | f      
  |              |                                  |             |                     | f        | f
      | failover_slot | pgoutput | logical   |      5 | postgres | f         | f      |            |      |          764 | 0/B000060   | 0/B000098           | reserved   |               | f      
  |              | 2025-05-23 07:55:31.277584+05:30 | f           |                     | t        | f
(2 rows)
=========
 
Is there any way to sync up the replication slot after the catalog changes have been made after creation?

The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do pg_logical_slot_get_changes() API before performing sync? You can check the xmin of the logical slot after get_changes to ensure that xmin has moved to 765 in your case.

--
With Regards,
Amit Kapila.

Re: Replication slot is not able to sync up

From
Robert Haas
Date:
On Fri, May 23, 2025 at 12:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do
pg_logical_slot_get_changes()API before performing sync? You can check the xmin of the logical slot after get_changes
toensure that xmin has moved to 765 in your case. 

I'm fairly dismayed by this example. I hope I'm misunderstanding
something, because otherwise I have difficulty understanding how we
thought it was OK to ship this feature in this condition.

At the moment that pg_sync_replication_slots() is executed, a slot
named failover_slot exists on only one of the two servers. How can you
justify emitting an error message complaining that "remote slot
precedes local slot"? There's only one slot! I understand that, under
the hood, we probably created an additional slot on the standby and
then tried to fast-forward it, and this error occurred in the second
step. But a user shouldn't have to understand those kinds of internal
implementation details to make sense of the error message. If the
problem is that we're not able to create a slot on the standby at an
old enough LSN or XID position to permit its use with the
corresponding slot on the master, it should be reported that way.

It also seems like having to execute a manual step like
pg_logical_slot_get_changes() in order for things to work is really
missing the point of the feature. I mean, it seems like the intention
of the feature was that someone can just periodically call
pg_sync_replication_slots() on each standby and the right things will
happen -- creating slots or fast-forwarding them or dropping them, as
required. But if that sometimes requires manual fiddling like having
to consume changes from a slot then basically the feature just doesn't
work, because now the user will have to somehow understand when that
is required and what they need to do to fix it. This doesn't even seem
like a particularly obscure case.

To be honest, even after spending quite a bit of time on this, I still
don't really understand what's happening with the xmins here. Just
after creating the logical slot on the primary, it has xmin 764 on one
slot and xmin 765 on the other, and I don't understand why that's the
case, nor why the extra DDL command is needed to trigger the problem.
But I also can't shake the feeling that I shouldn't *need* to
understand that stuff to use the feature. Isn't that the whole point?

--
Robert Haas
EDB: http://www.enterprisedb.com