Re: Local replication "slot does not exist" after initial sync - Mailing list pgsql-general

From Justin
Subject Re: Local replication "slot does not exist" after initial sync
Date
Msg-id CALL-XePzMZt3G-5A4kJivq9tvEjVsZT_0VSC1bB_U81yvThdiQ@mail.gmail.com
Whole thread Raw
In response to Re: Local replication "slot does not exist" after initial sync  (Mike Lissner <mlissner@michaeljaylissner.com>)
List pgsql-general

On Sun, Feb 25, 2024 at 1:11 PM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
Sorry, two more little things here. The publisher logs add much, but here's what we see:

STATEMENT: START_REPLICATION SLOT "pg_20031_sync_17418_7324846428853951375" LOGICAL F1D0/346C6508 (proto_version '2', publication_names '"compass_publication2"')
ERROR: replication slot "pg_20031_sync_17402_7324846428853951375" does not exist

And I thought that maybe there'd be some magic in the REFRESH command on the subscriber, so I tried that:

alter subscription xyz refresh publication;

To nobody's surprise, that didn't help. :)


On Sun, Feb 25, 2024 at 10:00 AM Mike Lissner <mlissner@michaeljaylissner.com> wrote:
Hi, I set up logical replication a few days ago, but it's throwing some weird log lines that have me worried. Does anybody have experience with lines like the following on a subscriber:

LOG: logical replication table synchronization worker for subscription "compass_subscription", table "search_opinionscitedbyrecapdocument" has started
ERROR: could not start WAL streaming: ERROR: replication slot "pg_20031_sync_17418_7324846428853951375" does not exist
LOG: background worker "logical replication worker" (PID 1014) exited with exit code 1

Slots with this kind of name (pg_xyz_sync_*) are created during the initial sync, but it seems like the subscription is working based on a quick look in a few tables.

I thought this might be related to running out of slots on the publisher, so I increased both max_replication_slots and max_wal_senders to 50 and rebooted so those would take effect. No luck.

I thought rebooting the subscriber might help. No luck.

When I look in the publisher to see the slots we have...

SELECT * FROM pg_replication_slots;

...I do not see the one that's missing according to the log lines.

So it seems like the initial sync might have worked properly (tables have content), but that I have an errant process on the subscriber that might be stuck in a retry loop.

I haven't been able to fix this, and I think my last attempt might be a new subscription with copy_data=false, but I'd rather avoid that if I can.

Is there a way to fix or understand this so that I don't get the log lines forever and so that I can be confident the replication is in good shape?

Thank you!


Mike

Hi Mike
This happens when there is an error with another sync worker,  PG has issues when this happens, the Logical Replication Main Worker  attempts to sync another table, creates a slots, then tries to sync that table,  then errors out as it is out of sync workers,  then attempts to sync another table. the Subscriber is then stuck in a loop  using up all the slots on the publisher,.  The subscriber continues to cycle through all the tables that are not status r in the pg_subscription_rel table.  Once there no slots available on the publisher the subscriber attempts to create slots, fails and returns the above errors.

These are not critical errors however it is difficult to clean   up as every time starting and stopping the subscriber it immediately retries to initial sync the tables it has slots for.  So it keeps using  up all the publisher slots.

You need to look through the logs looking for the initial error that kicked this off,  it can be a duplicate key or the subscriber or publisher disconnected due to wal_reciever_timeout or wal_sender_timerout.   

I typical disable all the timeouts as it does not take much to cause the subscriber and publisher to think the connection has timed out.

A tell tale sign of this is the pg_subscription_rel status will be d for a bunch of tables.  There should only be few tables that are status d,

I only attempt to clean this up when working on big databases and a lot of tables have already been sync.   

Need to figure out the initial cause of the problem, correct that error and attempt an initial sync again.

  . 

pgsql-general by date:

Previous
From: Mike Lissner
Date:
Subject: Re: Local replication "slot does not exist" after initial sync
Next
From: jian he
Date:
Subject: Re: Documentation diff