RE: Excessive number of replication slots for 12->14 logical replication - Mailing list pgsql-bugs

From Zhijie Hou (Fujitsu)
Subject RE: Excessive number of replication slots for 12->14 logical replication
Date
Msg-id OS0PR01MB57165FF10C478BFBB837696A94762@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Excessive number of replication slots for 12->14 logical replication  (vignesh C <vignesh21@gmail.com>)
Responses Re: Excessive number of replication slots for 12->14 logical replication  (Bowen Shi <zxwsbg12138@gmail.com>)
List pgsql-bugs
On Saturday, January 20, 2024 12:40 AM vignesh C <vignesh21@gmail.com> wrote:

Hi,

> 
> On Thu, 18 Jan 2024 at 13:00, Bowen Shi <zxwsbg12138@gmail.com> wrote:
> >
> > Dears,
> >
> > I encountered a similar problem when I used logical replication to replicate
> databases from pg 16 to pg 16.
> >
> > I started 3 subscription in parallel, and  subscriber's postgresql.conf is
> following:
> > max_replication_slots = 10
> > max_sync_workers_per_subscription = 2
> >
> > However, after 3 minutes, I found three COPY errors in subscriber:
> > "error while shutting down streaming COPY: ERROR:  could not find record
> while sending logically-decoded data: missing contrecord at xxxx/xxxxxxxxx""
> > Then,  the subscriber began to print a large number of errors: "could not find
> free replication state slot for replication origin with ID 11, Increase
> max_replication_slots and try again."
> >
> > And the publisher was full of pg_xxx_sync_xxxxxxx slots, printing lots of "all
> replication slots are in use, Free one or increase max_replication_slots."
> >
> > This question is very similar to
> https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depe
> sz.com . When the table sync worker encounters an error and exits while copying
> a table, the replication origin will not be deleted. And new table sync workers
> would create sync slot in the publisher and then exit without dropping them.
> 
> I had tried various tests with the suggested configuration, but I did not hit this
> scenario. I was able to simulate this problem with a lesser number of
> max_replication_slots, but the behavior is as expected in this case.
> If you have a test case or logs for this, can you share it please. It will be easier to
> generate the sequence of things that is happening and to project a clear picture
> of what is happening.

I think the reason for these origin/slots ERRORs could be that the table sync worker
don't drop the origin and slot on ERROR (The table sync worker only drop these
after finishing the sync in process_syncing_tables_for_sync).

So, if one table sync worker exited due to ERROR, and the apply worker may be trying
to start more workers but the origin number of previous errored table sync
worker has not been dropped, causing a bunch of origin/slots ERRORs.

If the above reason is correct, maybe we could somehow drop the origin and
slots on ERROR exit as well, although it needs some analysis.

BTW, for the first root ERROR("COPY: ERROR:  could not find record while
sending logically-decoded data: missing contrecord at xxxx/xxxxxxxxx") which
causes the following slot/origin, I am not sure what would cause this.

As Vignesh mentioned, it would be better to provide log file in both publisher and
subscriber to do further analysis.

Best Regards,
Hou zj

pgsql-bugs by date:

Previous
From: Devrim Gündüz
Date:
Subject: Re: BUG #18304: Faulty proj93 RPM package in EL9 repo
Next
From: Bowen Shi
Date:
Subject: Re: Excessive number of replication slots for 12->14 logical replication