Re: Runaway Initial Table Syncs in Logical Replication? - Mailing list pgsql-general

From Don Seiler
Subject Re: Runaway Initial Table Syncs in Logical Replication?
Date
Msg-id CAHJZqBAkg27MFv9xbMTuJvcL25EhS2mzxynDr7bLFPcdqZZ=ew@mail.gmail.com
Whole thread Raw
In response to RE: Runaway Initial Table Syncs in Logical Replication?  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
List pgsql-general
On Thu, Aug 3, 2023 at 9:12 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

Just to confirm, which subversion of PG15 are you using on the subscriber side ?
Is it 15.3 ? I am confirming because I recall we fixed similar bug before in commit [1].

Yes. Specifically on the target/subscriber:

# select version();
                                                              version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.3 (Ubuntu 15.3-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
(1 row)


Installed on Ubuntu 22.04 via the PGDG apt repo. 

The source/publisher side is:

# select version();
                                                              version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.15 (Ubuntu 12.15-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
(1 row)


Are there any ERRORs(including erros other than the slots number) in the log of publisher and subscriber.
It would be helpful to see the log in case there are some other ERRORs that could cause problem.

On the subscriber I see a series of errors similar to this:

2023-08-03 18:22:50.196 UTC [239889] LOG:  logical replication table synchronization worker for subscription "sub01", table "users" has started
2023-08-03 18:22:50.209 UTC [239889] ERROR:  could not create replication slot "pg_19742_sync_18602_7263122209699118815": ERROR:  all replication slots are in use
        HINT:  Free one or increase max_replication_slots.

on the publisher I see this corresponding message:

2023-08-03 18:22:50.199 UTC [13884] [unknown]@[unknown] - [unknown] LOG:  connection received: host=10.228.88.74 port=47338
2023-08-03 18:22:50.207 UTC [13884] migrator@foo - [unknown] LOG:  replication connection authorized: user=migrator application_name=pg_19742_sync_18602_7263122209699118815 SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 ERROR:  all replication slots are in use
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 HINT:  Free one or increase max_replication_slots.
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 STATEMENT:  CREATE_REPLICATION_SLOT "pg_19742_sync_18602_7263122209699118815" LOGICAL pgoutput USE_SNAPSHOT
2023-08-03 18:22:50.209 UTC [13884] migrator@foo - pg_19742_sync_18602_7263122209699118815 LOG:  disconnection: session time: 0:00:00.009 user=migrator database=foo host=10.228.88.74 port=47338


and later they include this message about trying to drop them first on the subscriber:

2023-08-03 18:47:39.485 UTC [263502] LOG:  logical replication table synchronization worker for subscription "sub01", table "foo_accounts" has started
2023-08-03 18:47:39.496 UTC [263502] LOG:  could not drop replication slot "pg_19742_sync_17238_7263122209699118815" on publisher: ERROR:  replication slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:39.497 UTC [263502] ERROR:  could not create replication slot "pg_19742_sync_17238_7263122209699118815": ERROR:  all replication slots are in use
        HINT:  Free one or increase max_replication_slots.


and on the publisher side for that:

2023-08-03 18:47:39.487 UTC [3232] [unknown]@[unknown] - [unknown] LOG:  connection received: host=10.228.88.74 port=17390
2023-08-03 18:47:39.494 UTC [3232] migrator@foo - [unknown] LOG:  replication connection authorized: user=migrator application_name=pg_19742_sync_17238_7263122209699118815 SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-08-03 18:47:39.495 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 ERROR:  replication slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:39.495 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 STATEMENT:  DROP_REPLICATION_SLOT pg_19742_sync_17238_7263122209699118815 WAIT
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 ERROR:  all replication slots are in use
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 HINT:  Free one or increase max_replication_slots.
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 STATEMENT:  CREATE_REPLICATION_SLOT "pg_19742_sync_17238_7263122209699118815" LOGICAL pgoutput USE_SNAPSHOT
2023-08-03 18:47:39.496 UTC [3232] migrator@
foo - pg_19742_sync_17238_7263122209699118815 LOG:  disconnection: session time: 0:00:00.009 user=migrator datab
ase=
foo host=10.228.88.74 port=17390

and then this from a different pid, maybe some kind of cleanup process? It was doing this for what seemed like all of the sync slots:

2023-08-03 18:47:47.975 UTC [3303] migrator@foo - sub01 ERROR:  replication slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:47.975 UTC [3303] migrator@
foo - sub01 STATEMENT:  DROP_REPLICATION_SLOT pg_19742_sync_17238_7263122209699118815 WAIT

--
Don Seiler
www.seiler.us

pgsql-general by date:

Previous
From: Karsten Hilbert
Date:
Subject: Re: question on auto_explain
Next
From: jacktby jacktby
Date:
Subject: How does pg parse 'select '(1,2)''