Thread: Logical replication stops dropping used initial-sync replication slots

Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
Hi,
I have following case:

Primary: Ubuntu Bionic with Pg 12.9, and ~ 60k tables.
Replica: Ubuntu Focal with Pg 14.2.

Settings are:
 max_wal_senders                   │ 50
 wal_sender_timeout                │ 60000
 max_sync_workers_per_subscription │ 10
 max_worker_processes              │ 50

I made publication using:

create publication focal14;
select format('ALTER PUBLICATION focal14 ADD TABLE ONLY %I.%I;', n.nspname, c.relname) from pg_class c join
pg_namespacen on c.relnamespace = n.oid where n.nspname ~ 'shard' and c.relkind in ('p', 'r') order by n.nspname,
c.relname\gexec
 

Worked.

Then on replica I did:

create subscription focal14 connection 'host=${source_ip} port=5432 user=lrep_user dbname=${dbname}' publication
focal14;

This started initial sync.
Happily it was working with 10 concurrent workers, but after ~ hour,
I started to get, in logs on replica:

2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48 UTC,8/447946,23877409,ERROR,53400,"could
notfind free replication state slot for replication origin with OID 51",,"Increase max_replication_slots and try
again.",,,,,,,"","logicalreplication worker",,0
 

And sure enough, there are MANY replication slots on primary:

                  slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 
 

---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
 focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319873 | 8B7/89D938C0 | 8B7/93FD2380
 
 pg_4962101_sync_4663369_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14139|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708
 
 pg_4962101_sync_4794836_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160320058 | 8B8/27FE6668 | 8B8/27FE66A0
 
 pg_4962101_sync_4613771_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14129|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708
 
 pg_4962101_sync_4681288_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160320058 | 8B8/27FE6630 | 8B8/27FE6668
 
 pg_4962101_sync_4596929_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160320058 | 8B8/27FE65F8 | 8B8/27FE6630
 
 pg_4962101_sync_4394969_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14131|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708
 
 pg_4962101_sync_4402927_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14130|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5EFFDE90
 
 pg_4962101_sync_4458658_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160320058 | 8B8/27FE66A0 | 8B8/27FE66D8
 
 pg_4962101_sync_4374836_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14132|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5EFFDE90
 
 pg_4962101_sync_4562514_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14137|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708
 
 pg_4962101_sync_4588090_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14138|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5EFFDE90
 
 pg_4962101_sync_4640031_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4188 | 8B7/940D41C0
 
 pg_4962101_sync_4602213_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D41C0 | 8B7/940D41F8
 
 pg_4962101_sync_4530451_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D41F8 | 8B7/940D4230
 
 pg_4962101_sync_4667494_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4230 | 8B7/940D4268
 
 pg_4962101_sync_4452706_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4268 | 8B7/940D42A0
 
 pg_4962101_sync_4553748_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D42A0 | 8B7/940D42D8
 
 pg_4962101_sync_4592679_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D42D8 | 8B7/940D4310
 
 pg_4962101_sync_4600185_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4310 | 8B7/940D4348
 
 pg_4962101_sync_4659756_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4348 | 8B7/940D4380
 
 pg_4962101_sync_4389357_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4380 | 8B7/940D43B8
 
 pg_4962101_sync_4773549_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D43B8 | 8B7/940D43F0
 
 pg_4962101_sync_4784350_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D43F0 | 8B7/940D4428
 
 pg_4962101_sync_4794832_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4428 | 8B7/940D4460
 
 pg_4962101_sync_4678346_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4460 | 8B7/940D4498
 
 pg_4962101_sync_4703039_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4498 | 8B7/940D44D0
 
 pg_4962101_sync_4496803_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D44D0 | 8B7/940D4508
 
 pg_4962101_sync_4612692_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4508 | 8B7/940D4540
 
 pg_4962101_sync_4779832_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4540 | 8B7/940D4578
 
 pg_4962101_sync_4569154_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4578 | 8B7/940D45B0
 
 pg_4962101_sync_4561772_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D45B0 | 8B7/940D45E8
 
 pg_4962101_sync_4452829_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D45E8 | 8B7/940D4620
 
 pg_4962101_sync_4490381_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4620 | 8B7/940D4658
 
 pg_4962101_sync_4585733_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4658 | 8B7/940D4690
 
 pg_4962101_sync_4459745_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14140|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708
 
 pg_4962101_sync_4508354_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14141|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708
 
 pg_4962101_sync_4451993_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
14142|      |    160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 
 pg_4962101_sync_4491715_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4770 | 8B7/940D47A8
 
 pg_4962101_sync_4404697_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D47A8 | 8B7/940D47E0
 
 pg_4962101_sync_4520525_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D47E0 | 8B7/940D4818
 
 pg_4962101_sync_4508992_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4818 | 8B7/940D4850
 
 pg_4962101_sync_4474811_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4850 | 8B7/940D4888
 
 pg_4962101_sync_4514152_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4888 | 8B7/940D48C0
 
 pg_4962101_sync_4379697_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D48C0 | 8B7/940D48F8
 
 pg_4962101_sync_4677962_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D48F8 | 8B7/940D4930
 
 pg_4962101_sync_4554683_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4930 | 8B7/940D4968
 
 pg_4962101_sync_4450272_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D4968 | 8B7/940D49A0
 
 pg_4962101_sync_4717161_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D49A0 | 8B7/940D49D8
 
 pg_4962101_sync_4784928_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160319884 | 8B7/940D49D8 | 8B7/940D4A10
 
(50 rows)

For some reason there are MANY inactive slots. But why?

I checked in logs for some slot that worked well, and found:

2022-03-17 12:59:58.069 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,2,"authentication",2022-03-17
12:59:58UTC,13/587955,0,LOG,00000,"replication connection authorized: user=lrep_user
application_name=pg_4962101_sync_4512642_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,""
 
2022-03-17 12:59:58.071 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,3,"SELECT",2022-03-17 12:59:58
UTC,13/0,0,LOG,00000,"duration:0.381 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846"
2022-03-17 12:59:58.072 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,4,"BEGIN",2022-03-17 12:59:58
UTC,13/587957,0,LOG,00000,"duration:0.043 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846"
2022-03-17 12:59:58.085 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,5,"idle in transaction",2022-03-17
12:59:58UTC,13/587957,0,LOG,00000,"logical decoding found consistent point at 8AF/6FDC8110","There are no running
transactions.",,,,,"CREATE_REPLICATION_SLOT""pg_4962101_sync_4512642_7073527418723003846"" LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_4962101_sync_4512642_7073527418723003846"
2022-03-17 13:00:00.126 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,6,"SELECT",2022-03-17 12:59:58
UTC,13/587957,0,LOG,00000,"duration:1.402 ms  statement: SELECT c.oid, c.relreplident, c.relkind  FROM
pg_catalog.pg_classc  INNER JOIN pg_catalog.pg_namespace n        ON (c.relnamespace = n.oid) WHERE n.nspname =
'aaa_shard_bbb'  AND c.relname = 'some_table'",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846"
 
2022-03-17 13:00:00.130 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,7,"SELECT",2022-03-17 12:59:58
UTC,13/587957,0,LOG,00000,"duration:3.495 ms  statement: SELECT a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(163890290))WHERE a.attnum > 0::pg_catalog.int2   AND NOT a.attisdropped AND
a.attgenerated= ''   AND a.attrelid = 163890290 ORDER BY
a.attnum",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846"
2022-03-17 13:00:00.134 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,8,"COPY",2022-03-17 12:59:58
UTC,13/587957,0,LOG,00000,"duration:3.525 ms  statement: COPY aaa_shard_bbb.some_table TO
STDOUT",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846"
2022-03-17 13:00:00.135 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,9,"COMMIT",2022-03-17 12:59:58
UTC,13/0,0,LOG,00000,"duration:0.039 ms  statement: COMMIT",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846"
 
2022-03-17 13:00:01.162 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,10,"idle",2022-03-17 12:59:58
UTC,13/0,0,LOG,00000,"startinglogical decoding for slot ""pg_4962101_sync_4512642_7073527418723003846""","Streaming
transactionscommitting after 8AF/6FDC8148, reading WAL from 8AF/6FDC8110.",,,,,"START_REPLICATION SLOT
""pg_4962101_sync_4512642_7073527418723003846""LOGICAL 8AF/6FDC8148 (proto_version '1', publication_names
'""focal14""')",,,"pg_4962101_sync_4512642_7073527418723003846"
2022-03-17 13:00:01.162 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,11,"idle",2022-03-17 12:59:58
UTC,13/0,0,LOG,00000,"logicaldecoding found consistent point at 8AF/6FDC8110","There are no running
transactions.",,,,,"START_REPLICATIONSLOT ""pg_4962101_sync_4512642_7073527418723003846"" LOGICAL 8AF/6FDC8148
(proto_version'1', publication_names '""focal14""')",,,"pg_4962101_sync_4512642_7073527418723003846"
 

Not sure when this slot got removed, but it's no longer there.

But, a slot that currently exists and is inactive:

2022-03-17 13:32:48.401 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,2,"authentication",2022-03-17
13:32:48UTC,11/150184,0,LOG,00000,"replication connection authorized: user=lrep_user
application_name=pg_4962101_sync_4474811_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,""
 
2022-03-17 13:32:48.403 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,3,"SELECT",2022-03-17 13:32:48
UTC,11/0,0,LOG,00000,"duration:0.408 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_4962101_sync_4474811_7073527418723003846"
2022-03-17 13:32:48.405 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,4,"BEGIN",2022-03-17 13:32:48
UTC,11/150186,0,LOG,00000,"duration:0.061 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_4962101_sync_4474811_7073527418723003846"
2022-03-17 13:32:48.419 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,5,"idle in transaction",2022-03-17
13:32:48UTC,11/150186,0,LOG,00000,"logical decoding found consistent point at 8B7/940D4850","There are no running
transactions.",,,,,"CREATE_REPLICATION_SLOT""pg_4962101_sync_4474811_7073527418723003846"" LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_4962101_sync_4474811_7073527418723003846"

and some that is currently active:

2022-03-17 13:32:48.349 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,2,"authentication",2022-03-17
13:32:48UTC,11/150181,0,LOG,00000,"replication connection authorized: user=lrep_user
application_name=pg_4962101_sync_4663369_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,""
 
2022-03-17 13:32:48.354 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,3,"SELECT",2022-03-17
13:32:48UTC,11/0,0,LOG,00000,"duration: 0.397 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:32:48.356 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,4,"BEGIN",2022-03-17 13:32:48
UTC,11/150183,0,LOG,00000,"duration:0.049 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:32:48.381 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,5,"idle in
transaction",2022-03-1713:32:48 UTC,11/150183,0,LOG,00000,"logical decoding found consistent point at
8B7/940D4690","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:11.788 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,2,"authentication",2022-03-17
13:35:11UTC,11/150814,0,LOG,00000,"replication connection authorized: user=lrep_user
application_name=pg_4962101_sync_4663369_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,""
 
2022-03-17 13:35:11.790 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,3,"SELECT",2022-03-17
13:35:11UTC,11/0,0,LOG,00000,"duration: 0.414 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:11.793 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,4,"BEGIN",2022-03-17 13:35:11
UTC,11/150816,0,LOG,00000,"duration:0.058 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:11.805 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,5,"idle in
transaction",2022-03-1713:35:11 UTC,11/150816,0,LOG,00000,"logical decoding found initial starting point at
8B8/2D0B3DB0","Waitingfor transactions (approximately 1) older than 160320063 to end.",,,,,"CREATE_REPLICATION_SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:11.915 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,6,"idle in
transaction",2022-03-1713:35:11 UTC,11/150816,0,LOG,00000,"logical decoding found initial consistent point at
8B8/2D1859D0","Waitingfor transactions (approximately 1) older than 160320064 to end.",,,,,"CREATE_REPLICATION_SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.035 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,7,"idle in
transaction",2022-03-1713:35:11 UTC,11/150816,0,LOG,00000,"logical decoding found consistent point at
8B8/2D1874C0","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.050 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,8,"SELECT",2022-03-17
13:35:11UTC,11/150816,0,LOG,00000,"duration: 1.056 ms  statement: SELECT c.oid, c.relreplident, c.relkind  FROM
pg_catalog.pg_classc  INNER JOIN pg_catalog.pg_namespace n        ON (c.relnamespace = n.oid) WHERE n.nspname =
'aaa_shard_ccc'  AND c.relname = 'OTHER_TABLE'",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
 
2022-03-17 13:35:12.055 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,9,"SELECT",2022-03-17
13:35:11UTC,11/150816,0,LOG,00000,"duration: 3.441 ms  statement: SELECT a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(308468275))WHERE a.attnum > 0::pg_catalog.int2   AND NOT a.attisdropped AND
a.attgenerated= ''   AND a.attrelid = 308468275 ORDER BY
a.attnum",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.063 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,10,"COPY",2022-03-17 13:35:11
UTC,11/150816,0,LOG,00000,"duration:7.618 ms  statement: COPY aaa_shard_ccc.OTHER_TABLE TO
STDOUT",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.069 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,11,"COMMIT",2022-03-17
13:35:11UTC,11/0,0,LOG,00000,"duration: 0.051 ms  statement:
COMMIT",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.069 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,12,"idle",2022-03-17 13:35:11
UTC,11/0,0,LOG,00000,"startinglogical decoding for slot ""pg_4962101_sync_4663369_7073527418723003846""","Streaming
transactionscommitting after 8B8/2D1874F8, reading WAL from 8B8/2D0AB818.",,,,,"START_REPLICATION SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL 8B8/2D1874F8 (proto_version '1', publication_names
'""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.069 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,13,"idle",2022-03-17 13:35:11
UTC,11/0,0,LOG,00000,"logicaldecoding found initial starting point at 8B8/2D0B3DB0","Waiting for transactions
(approximately1) older than 160320063 to end.",,,,,"START_REPLICATION SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL 8B8/2D1874F8 (proto_version '1', publication_names
'""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.083 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,14,"idle",2022-03-17 13:35:11
UTC,11/0,0,LOG,00000,"logicaldecoding found initial consistent point at 8B8/2D1859D0","Waiting for transactions
(approximately1) older than 160320064 to end.",,,,,"START_REPLICATION SLOT
""pg_4962101_sync_4663369_7073527418723003846""LOGICAL 8B8/2D1874F8 (proto_version '1', publication_names
'""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846"
2022-03-17 13:35:12.090 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,15,"idle",2022-03-17 13:35:11
UTC,11/0,0,LOG,00000,"logicaldecoding found consistent point at 8B8/2D1874C0","There are no running
transactions.",,,,,"START_REPLICATIONSLOT ""pg_4962101_sync_4663369_7073527418723003846"" LOGICAL 8B8/2D1874F8
(proto_version'1', publication_names '""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846"
 

Looks that, for some reason, it doesn't get data on next table to copy
data from, and is kinda stuck in process...

What could it be? How can I fix that?

We need to use this (logical replication) as we want to upgrade OS, and
there is locale problem, which prohibits usage of physical replication.

Best regards,

depesz




Re: Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
On Thu, Mar 17, 2022 at 03:24:26PM +0100, hubert depesz lubaczewski wrote:
> Hi,
> I have following case:
>
> Primary: Ubuntu Bionic with Pg 12.9, and ~ 60k tables.
> Replica: Ubuntu Focal with Pg 14.2.

Upgraded primary to 12.10, redid everything. Same problem.

Interestingly, now when I look at pg_replication_slots, I see:

#v+
                  slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 

---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
 focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417199 | 8FA/F601EF68 | 8FA/F87F13E8 
 pg_6169636_sync_5600099_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17696|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5762839_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17697|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5953088_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17698|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5735455_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17699|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5873494_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3008 | 8FA/F88F3040 
 pg_6169636_sync_5913846_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17700|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5664472_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17701|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5944971_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17702|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5887578_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17707|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5597814_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17703|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5706390_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17706|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
 pg_6169636_sync_5900789_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3158 | 8FA/F88F3190 
 pg_6169636_sync_5719614_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3190 | 8FA/F88F31C8 
 pg_6169636_sync_5754127_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F31C8 | 8FA/F88F3200 
 pg_6169636_sync_5667313_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3200 | 8FA/F88F3238 
 pg_6169636_sync_5904579_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3238 | 8FA/F88F3270 
 pg_6169636_sync_5922710_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3270 | 8FA/F88F32A8 
 pg_6169636_sync_5698143_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F32A8 | 8FA/F88F32E0 
 pg_6169636_sync_5579916_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F32E0 | 8FA/F88F3318 
 pg_6169636_sync_5938883_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3318 | 8FA/F88F3350 
 pg_6169636_sync_5654015_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3350 | 8FA/F88F3388 
 pg_6169636_sync_5762125_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3388 | 8FA/F88F33C0 
 pg_6169636_sync_5760797_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F33C0 | 8FA/F88F33F8 
 pg_6169636_sync_5851313_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F33F8 | 8FA/F88F3430 
 pg_6169636_sync_5916213_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3430 | 8FA/F88F3468 
 pg_6169636_sync_5999341_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3468 | 8FA/F88F34A0 
 pg_6169636_sync_5896729_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 
 pg_6169636_sync_5748694_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 
 pg_6169636_sync_5603063_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417359 | 8FB/782067B8 | 8FB/782067F0 
 pg_6169636_sync_5859189_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3510 | 8FA/F88F3548 
 pg_6169636_sync_5740988_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417359 | 8FB/782067F0 | 8FB/78206828 
 pg_6169636_sync_5780890_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417359 | 8FB/78206828 | 8FB/78206860 
 pg_6169636_sync_5900300_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417359 | 8FB/78206860 | 8FB/78206898 
 pg_6169636_sync_5983861_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417359 | 8FB/78609D40 | 8FB/7A9D6028 
 pg_6169636_sync_5704266_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417359 | 8FB/7A5C90B0 | 8FB/7A9D6028 
 pg_6169636_sync_5950291_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417362 | 8FB/7AAD60C0 | 8FB/7AAD60F8 
 pg_6169636_sync_5687359_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417362 | 8FB/7AAD60F8 | 8FB/7AAD6130 
 pg_6169636_sync_5591454_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F37B8 | 8FA/F88F37F0 
 pg_6169636_sync_5695446_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F37F0 | 8FA/F88F3828 
 pg_6169636_sync_5654295_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3828 | 8FA/F88F3860 
 pg_6169636_sync_5959324_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3828 | 8FA/F88F3860 
 pg_6169636_sync_5653706_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F3898 | 8FA/F88F38D0 
 pg_6169636_sync_5625296_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417208 | 8FA/F88F38D0 | 8FA/F88F3908 
 pg_6169636_sync_5599947_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417362 | 8FB/7AAD6130 | 8FB/7AAD6168 
 pg_6169636_sync_5900341_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417362 | 8FB/7AAD6168 | 8FB/7AAD61A0 
 pg_6169636_sync_5657835_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417362 | 8FB/7AAD61A0 | 8FB/7AAD61D8 
 pg_6169636_sync_5929106_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    160417362 | 8FB/7AAD61D8 | 8FB/7AAD6210 
(48 rows)
#v-

These 10 "active" slots. When I checked their data in pg_stat_activity, it looks that these are stuck in
"START_REPLICATIONSLOT": 

#v+
=# select pid, application_name, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event,
state,backend_xid, backend_xmin, query, backend_type from pg_stat_activity where pid in (select active_pid from
pg_replication_slots where active); 
  pid  │              application_name               │         backend_start         │ xact_start │
query_start         │         state_change          │ wait_event_type │     wait_event      │ state  │ backend_xid │
backend_xmin│                                                                    query
                                  │ backend_type  

═══════╪═════════════════════════════════════════════╪═══════════════════════════════╪════════════╪═══════════════════════════════╪═══════════════════════════════╪═════════════════╪═════════════════════╪════════╪═════════════╪══════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════
 17696 │ pg_6169636_sync_5600099_7073527418723003846 │ 2022-03-17 18:28:53.83822+00  │ [null]     │ 2022-03-17
18:28:54.178221+00│ 2022-03-17 18:28:54.178229+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5600099_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17697 │ pg_6169636_sync_5762839_7073527418723003846 │ 2022-03-17 18:28:53.840978+00 │ [null]     │ 2022-03-17
18:28:54.182595+00│ 2022-03-17 18:28:54.182605+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5762839_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17698 │ pg_6169636_sync_5953088_7073527418723003846 │ 2022-03-17 18:28:53.851774+00 │ [null]     │ 2022-03-17
18:28:54.182806+00│ 2022-03-17 18:28:54.182813+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5953088_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17699 │ pg_6169636_sync_5735455_7073527418723003846 │ 2022-03-17 18:28:53.862408+00 │ [null]     │ 2022-03-17
18:28:54.181037+00│ 2022-03-17 18:28:54.181046+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5735455_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17700 │ pg_6169636_sync_5913846_7073527418723003846 │ 2022-03-17 18:28:53.872853+00 │ [null]     │ 2022-03-17
18:28:54.179817+00│ 2022-03-17 18:28:54.179825+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5913846_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17701 │ pg_6169636_sync_5664472_7073527418723003846 │ 2022-03-17 18:28:53.883557+00 │ [null]     │ 2022-03-17
18:28:54.175043+00│ 2022-03-17 18:28:54.175053+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5664472_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17707 │ pg_6169636_sync_5887578_7073527418723003846 │ 2022-03-17 18:28:53.948718+00 │ [null]     │ 2022-03-17
18:28:54.176391+00│ 2022-03-17 18:28:54.1764+00   │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5887578_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17702 │ pg_6169636_sync_5944971_7073527418723003846 │ 2022-03-17 18:28:53.894247+00 │ [null]     │ 2022-03-17
18:28:54.177971+00│ 2022-03-17 18:28:54.177979+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5944971_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17703 │ pg_6169636_sync_5597814_7073527418723003846 │ 2022-03-17 18:28:53.90505+00  │ [null]     │ 2022-03-17
18:28:54.183205+00│ 2022-03-17 18:28:54.183213+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5597814_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
 17706 │ pg_6169636_sync_5706390_7073527418723003846 │ 2022-03-17 18:28:53.915736+00 │ [null]     │ 2022-03-17
18:28:54.186649+00│ 2022-03-17 18:28:54.186658+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5706390_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
(10 rows)
#v-

Please note that now() when I run these queries is ~ 2022-03-18 12:19:36.902819+00

Which means that these backends are stuck for ~ 18 hours now.

Description of wait event "WalSenderWaitForWAL" "Waiting for WAL to be flushed
in WAL sender process." doesn't seem to be OK, WAL writes are working OK:

#v+
=# select *, pg_current_wal_lsn() from pg_control_checkpoint() \gx
─[ RECORD 1 ]────────┬─────────────────────────
checkpoint_lsn       │ 9C7/3E1568B0
redo_lsn             │ 9C7/1F7988D0
redo_wal_file        │ 00000002000009C70000001F
timeline_id          │ 2
prev_timeline_id     │ 2
full_page_writes     │ t
next_xid             │ 0:160478730
next_oid             │ 310866725
next_multixact_id    │ 167477
next_multi_offset    │ 501565
oldest_xid           │ 479
oldest_xid_dbid      │ 13398
oldest_active_xid    │ 160478730
oldest_multi_xid     │ 1
oldest_multi_dbid    │ 16606
oldest_commit_ts_xid │ 602
newest_commit_ts_xid │ 160478729
checkpoint_time      │ 2022-03-18 12:17:03+00
pg_current_wal_lsn   │ 9C8/4EFC31E0
#v-

Archiving works, all seems to be well. But replication slots froze, and then
replica started bunch more.

If that would help I can find all things related to these pids (of active
slots), but it might be too much. For now, one, random, pid:

#v+
2022-03-17 18:16:54.684 UTC,,,17706,"10.2.210.180:44612",62337b16.452a,1,"",2022-03-17 18:16:54
UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=44612",,,,,,,,,"" 
2022-03-17 18:16:54.691 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,2,"authentication",2022-03-17
18:16:54UTC,11/57844,0,LOG,00000,"replication connection authorized: user=lrep_user
application_name=pg_6169636_sync_5581717_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 
2022-03-17 18:16:54.693 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,3,"SELECT",2022-03-17
18:16:54UTC,11/0,0,LOG,00000,"duration: 0.395 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.694 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,4,"BEGIN",2022-03-17 18:16:54
UTC,11/57846,0,LOG,00000,"duration:0.050 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.710 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,5,"idle in
transaction",2022-03-1718:16:54 UTC,11/57846,0,LOG,00000,"logical decoding found consistent point at
8F8/8CAB5958","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT
""pg_6169636_sync_5581717_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.715 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,6,"SELECT",2022-03-17
18:16:54UTC,11/57846,0,LOG,00000,"duration: 1.130 ms  statement: SELECT c.oid, c.relreplident, c.relkind  FROM
pg_catalog.pg_classc  INNER JOIN pg_catalog.pg_namespace n        ON (c.relnamespace = n.oid) WHERE n.nspname =
'shard_abcde'  AND c.relname = 'alerts_data'",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 
2022-03-17 18:16:54.720 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,7,"SELECT",2022-03-17
18:16:54UTC,11/57846,0,LOG,00000,"duration: 4.753 ms  statement: SELECT a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(164329186))WHERE a.attnum > 0::pg_catalog.int2   AND NOT a.attisdropped AND
a.attgenerated= ''   AND a.attrelid = 164329186 ORDER BY
a.attnum",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,8,"COPY",2022-03-17 18:16:54
UTC,11/57846,0,LOG,00000,"duration:1.878 ms  statement: COPY shard_abcde.alerts_data TO
STDOUT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,9,"COMMIT",2022-03-17
18:16:54UTC,11/0,0,LOG,00000,"duration: 0.037 ms  statement:
COMMIT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,10,"idle",2022-03-17 18:16:54
UTC,11/0,0,LOG,00000,"startinglogical decoding for slot ""pg_6169636_sync_5581717_7073527418723003846""","Streaming
transactionscommitting after 8F8/8CAB5990, reading WAL from 8F8/8CAB5958.",,,,,"START_REPLICATION SLOT
""pg_6169636_sync_5581717_7073527418723003846""LOGICAL 8F8/8CAB5990 (proto_version '1', publication_names
'""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846"
2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,11,"idle",2022-03-17 18:16:54
UTC,11/0,0,LOG,00000,"logicaldecoding found consistent point at 8F8/8CAB5958","There are no running
transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5581717_7073527418723003846"" LOGICAL 8F8/8CAB5990
(proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" 
2022-03-17 18:28:53.915 UTC,,,17706,"10.2.210.180:35794",62337de5.452a,1,"",2022-03-17 18:28:53
UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=35794",,,,,,,,,"" 
2022-03-17 18:28:53.923 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,2,"authentication",2022-03-17
18:28:53UTC,16/17576,0,LOG,00000,"replication connection authorized: user=lrep_user
application_name=pg_6169636_sync_5706390_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 
2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,3,"SELECT",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"duration: 0.421 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,4,"idle",2022-03-17 18:28:53
UTC,16/0,0,ERROR,42704,"replicationslot ""pg_6169636_sync_5706390_7073527418723003846"" does not
exist",,,,,,"DROP_REPLICATION_SLOTpg_6169636_sync_5706390_7073527418723003846
WAIT",,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:53.926 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,5,"BEGIN",2022-03-17 18:28:53
UTC,16/17578,0,LOG,00000,"duration:0.039 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:53.958 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,6,"idle in
transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found initial starting point at
8FB/83FCA7D8","Waitingfor transactions (approximately 1) older than 160417372 to end.",,,,,"CREATE_REPLICATION_SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.154 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,7,"idle in
transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found consistent point at
8FB/84D20960","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.169 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,8,"SELECT",2022-03-17
18:28:53UTC,16/17578,0,LOG,00000,"duration: 1.184 ms  statement: SELECT c.oid, c.relreplident, c.relkind  FROM
pg_catalog.pg_classc  INNER JOIN pg_catalog.pg_namespace n        ON (c.relnamespace = n.oid) WHERE n.nspname =
'shard_qwerty'  AND c.relname = 'sessions'",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 
2022-03-17 18:28:54.173 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,9,"SELECT",2022-03-17
18:28:53UTC,16/17578,0,LOG,00000,"duration: 2.354 ms  statement: SELECT a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(163887255))WHERE a.attnum > 0::pg_catalog.int2   AND NOT a.attisdropped AND
a.attgenerated= ''   AND a.attrelid = 163887255 ORDER BY
a.attnum",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.185 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,10,"COPY",2022-03-17 18:28:53
UTC,16/17578,0,LOG,00000,"duration:11.738 ms  statement: COPY shard_qwerty.sessions TO
STDOUT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,11,"COMMIT",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"duration: 0.039 ms  statement:
COMMIT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,12,"idle",2022-03-17 18:28:53
UTC,16/0,0,LOG,00000,"startinglogical decoding for slot ""pg_6169636_sync_5706390_7073527418723003846""","Streaming
transactionscommitting after 8FB/84D20998, reading WAL from 8FB/83EB4658.",,,,,"START_REPLICATION SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL 8FB/84D20998 (proto_version '1', publication_names
'""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17 18:28:53
UTC,16/0,0,LOG,00000,"logicaldecoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions
(approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL 8FB/84D20998 (proto_version '1', publication_names
'""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846"
2022-03-17 18:28:54.251 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,14,"idle",2022-03-17 18:28:53
UTC,16/0,0,LOG,00000,"logicaldecoding found consistent point at 8FB/84D20960","There are no running
transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5706390_7073527418723003846"" LOGICAL 8FB/84D20998
(proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" 
#v-

What can I do about it?

Best regards,

depesz




On Fri, Mar 18, 2022 at 5:58 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Mar 17, 2022 at 03:24:26PM +0100, hubert depesz lubaczewski wrote:
> > Hi,
> > I have following case:
> >
> > Primary: Ubuntu Bionic with Pg 12.9, and ~ 60k tables.
> > Replica: Ubuntu Focal with Pg 14.2.
>
> Upgraded primary to 12.10, redid everything. Same problem.
>
> Interestingly, now when I look at pg_replication_slots, I see:
>
> #v+
>                   slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 
>
---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
>  focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417199 | 8FA/F601EF68 | 8FA/F87F13E8 
>  pg_6169636_sync_5600099_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17696|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5762839_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17697|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5953088_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17698|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5735455_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17699|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5873494_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3008 | 8FA/F88F3040 
>  pg_6169636_sync_5913846_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17700|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5664472_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17701|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5944971_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17702|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5887578_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17707|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5597814_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17703|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5706390_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | t      |
17706|      |    160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 
>  pg_6169636_sync_5900789_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3158 | 8FA/F88F3190 
>  pg_6169636_sync_5719614_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3190 | 8FA/F88F31C8 
>  pg_6169636_sync_5754127_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F31C8 | 8FA/F88F3200 
>  pg_6169636_sync_5667313_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3200 | 8FA/F88F3238 
>  pg_6169636_sync_5904579_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3238 | 8FA/F88F3270 
>  pg_6169636_sync_5922710_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3270 | 8FA/F88F32A8 
>  pg_6169636_sync_5698143_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F32A8 | 8FA/F88F32E0 
>  pg_6169636_sync_5579916_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F32E0 | 8FA/F88F3318 
>  pg_6169636_sync_5938883_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3318 | 8FA/F88F3350 
>  pg_6169636_sync_5654015_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3350 | 8FA/F88F3388 
>  pg_6169636_sync_5762125_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3388 | 8FA/F88F33C0 
>  pg_6169636_sync_5760797_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F33C0 | 8FA/F88F33F8 
>  pg_6169636_sync_5851313_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F33F8 | 8FA/F88F3430 
>  pg_6169636_sync_5916213_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3430 | 8FA/F88F3468 
>  pg_6169636_sync_5999341_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3468 | 8FA/F88F34A0 
>  pg_6169636_sync_5896729_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 
>  pg_6169636_sync_5748694_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 
>  pg_6169636_sync_5603063_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417359 | 8FB/782067B8 | 8FB/782067F0 
>  pg_6169636_sync_5859189_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3510 | 8FA/F88F3548 
>  pg_6169636_sync_5740988_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417359 | 8FB/782067F0 | 8FB/78206828 
>  pg_6169636_sync_5780890_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417359 | 8FB/78206828 | 8FB/78206860 
>  pg_6169636_sync_5900300_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417359 | 8FB/78206860 | 8FB/78206898 
>  pg_6169636_sync_5983861_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417359 | 8FB/78609D40 | 8FB/7A9D6028 
>  pg_6169636_sync_5704266_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417359 | 8FB/7A5C90B0 | 8FB/7A9D6028 
>  pg_6169636_sync_5950291_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417362 | 8FB/7AAD60C0 | 8FB/7AAD60F8 
>  pg_6169636_sync_5687359_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417362 | 8FB/7AAD60F8 | 8FB/7AAD6130 
>  pg_6169636_sync_5591454_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F37B8 | 8FA/F88F37F0 
>  pg_6169636_sync_5695446_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F37F0 | 8FA/F88F3828 
>  pg_6169636_sync_5654295_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3828 | 8FA/F88F3860 
>  pg_6169636_sync_5959324_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3828 | 8FA/F88F3860 
>  pg_6169636_sync_5653706_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F3898 | 8FA/F88F38D0 
>  pg_6169636_sync_5625296_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417208 | 8FA/F88F38D0 | 8FA/F88F3908 
>  pg_6169636_sync_5599947_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417362 | 8FB/7AAD6130 | 8FB/7AAD6168 
>  pg_6169636_sync_5900341_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417362 | 8FB/7AAD6168 | 8FB/7AAD61A0 
>  pg_6169636_sync_5657835_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417362 | 8FB/7AAD61A0 | 8FB/7AAD61D8 
>  pg_6169636_sync_5929106_7073527418723003846 | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    160417362 | 8FB/7AAD61D8 | 8FB/7AAD6210 
> (48 rows)
> #v-
>
> These 10 "active" slots. When I checked their data in pg_stat_activity, it looks that these are stuck in
"START_REPLICATIONSLOT": 
>
> #v+
> =# select pid, application_name, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event,
state,backend_xid, backend_xmin, query, backend_type from pg_stat_activity where pid in (select active_pid from
pg_replication_slots where active); 
>   pid  │              application_name               │         backend_start         │ xact_start │
query_start         │         state_change          │ wait_event_type │     wait_event      │ state  │ backend_xid │
backend_xmin│                                                                    query
                                  │ backend_type 
>
═══════╪═════════════════════════════════════════════╪═══════════════════════════════╪════════════╪═══════════════════════════════╪═══════════════════════════════╪═════════════════╪═════════════════════╪════════╪═════════════╪══════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════
>  17696 │ pg_6169636_sync_5600099_7073527418723003846 │ 2022-03-17 18:28:53.83822+00  │ [null]     │ 2022-03-17
18:28:54.178221+00│ 2022-03-17 18:28:54.178229+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5600099_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17697 │ pg_6169636_sync_5762839_7073527418723003846 │ 2022-03-17 18:28:53.840978+00 │ [null]     │ 2022-03-17
18:28:54.182595+00│ 2022-03-17 18:28:54.182605+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5762839_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17698 │ pg_6169636_sync_5953088_7073527418723003846 │ 2022-03-17 18:28:53.851774+00 │ [null]     │ 2022-03-17
18:28:54.182806+00│ 2022-03-17 18:28:54.182813+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5953088_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17699 │ pg_6169636_sync_5735455_7073527418723003846 │ 2022-03-17 18:28:53.862408+00 │ [null]     │ 2022-03-17
18:28:54.181037+00│ 2022-03-17 18:28:54.181046+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5735455_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17700 │ pg_6169636_sync_5913846_7073527418723003846 │ 2022-03-17 18:28:53.872853+00 │ [null]     │ 2022-03-17
18:28:54.179817+00│ 2022-03-17 18:28:54.179825+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5913846_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17701 │ pg_6169636_sync_5664472_7073527418723003846 │ 2022-03-17 18:28:53.883557+00 │ [null]     │ 2022-03-17
18:28:54.175043+00│ 2022-03-17 18:28:54.175053+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5664472_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17707 │ pg_6169636_sync_5887578_7073527418723003846 │ 2022-03-17 18:28:53.948718+00 │ [null]     │ 2022-03-17
18:28:54.176391+00│ 2022-03-17 18:28:54.1764+00   │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5887578_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17702 │ pg_6169636_sync_5944971_7073527418723003846 │ 2022-03-17 18:28:53.894247+00 │ [null]     │ 2022-03-17
18:28:54.177971+00│ 2022-03-17 18:28:54.177979+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5944971_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17703 │ pg_6169636_sync_5597814_7073527418723003846 │ 2022-03-17 18:28:53.90505+00  │ [null]     │ 2022-03-17
18:28:54.183205+00│ 2022-03-17 18:28:54.183213+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5597814_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
>  17706 │ pg_6169636_sync_5706390_7073527418723003846 │ 2022-03-17 18:28:53.915736+00 │ [null]     │ 2022-03-17
18:28:54.186649+00│ 2022-03-17 18:28:54.186658+00 │ Client          │ WalSenderWaitForWAL │ active │      [null] │
[null] │ START_REPLICATION SLOT "pg_6169636_sync_5706390_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1',
publication_names'"focal14"') │ walsender 
> (10 rows)
> #v-
>
> Please note that now() when I run these queries is ~ 2022-03-18 12:19:36.902819+00
>
> Which means that these backends are stuck for ~ 18 hours now.
>
> Description of wait event "WalSenderWaitForWAL" "Waiting for WAL to be flushed
> in WAL sender process." doesn't seem to be OK, WAL writes are working OK:
>
> #v+
> =# select *, pg_current_wal_lsn() from pg_control_checkpoint() \gx
> ─[ RECORD 1 ]────────┬─────────────────────────
> checkpoint_lsn       │ 9C7/3E1568B0
> redo_lsn             │ 9C7/1F7988D0
> redo_wal_file        │ 00000002000009C70000001F
> timeline_id          │ 2
> prev_timeline_id     │ 2
> full_page_writes     │ t
> next_xid             │ 0:160478730
> next_oid             │ 310866725
> next_multixact_id    │ 167477
> next_multi_offset    │ 501565
> oldest_xid           │ 479
> oldest_xid_dbid      │ 13398
> oldest_active_xid    │ 160478730
> oldest_multi_xid     │ 1
> oldest_multi_dbid    │ 16606
> oldest_commit_ts_xid │ 602
> newest_commit_ts_xid │ 160478729
> checkpoint_time      │ 2022-03-18 12:17:03+00
> pg_current_wal_lsn   │ 9C8/4EFC31E0
> #v-
>
> Archiving works, all seems to be well. But replication slots froze, and then
> replica started bunch more.
>
> If that would help I can find all things related to these pids (of active
> slots), but it might be too much. For now, one, random, pid:
>
> #v+
> 2022-03-17 18:16:54.684 UTC,,,17706,"10.2.210.180:44612",62337b16.452a,1,"",2022-03-17 18:16:54
UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=44612",,,,,,,,,"" 
> 2022-03-17 18:16:54.691
UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,2,"authentication",2022-03-1718:16:54
UTC,11/57844,0,LOG,00000,"replicationconnection authorized: user=lrep_user
application_name=pg_6169636_sync_5581717_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 
> 2022-03-17 18:16:54.693 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,3,"SELECT",2022-03-17
18:16:54UTC,11/0,0,LOG,00000,"duration: 0.395 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
> 2022-03-17 18:16:54.694 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,4,"BEGIN",2022-03-17
18:16:54UTC,11/57846,0,LOG,00000,"duration: 0.050 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
> 2022-03-17 18:16:54.710 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,5,"idle in
transaction",2022-03-1718:16:54 UTC,11/57846,0,LOG,00000,"logical decoding found consistent point at
8F8/8CAB5958","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT
""pg_6169636_sync_5581717_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_6169636_sync_5581717_7073527418723003846"
> 2022-03-17 18:16:54.715 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,6,"SELECT",2022-03-17
18:16:54UTC,11/57846,0,LOG,00000,"duration: 1.130 ms  statement: SELECT c.oid, c.relreplident, c.relkind  FROM
pg_catalog.pg_classc  INNER JOIN pg_catalog.pg_namespace n        ON (c.relnamespace = n.oid) WHERE n.nspname =
'shard_abcde'  AND c.relname = 'alerts_data'",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 
> 2022-03-17 18:16:54.720 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,7,"SELECT",2022-03-17
18:16:54UTC,11/57846,0,LOG,00000,"duration: 4.753 ms  statement: SELECT a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(164329186))WHERE a.attnum > 0::pg_catalog.int2   AND NOT a.attisdropped AND
a.attgenerated= ''   AND a.attrelid = 164329186 ORDER BY
a.attnum",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
> 2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,8,"COPY",2022-03-17
18:16:54UTC,11/57846,0,LOG,00000,"duration: 1.878 ms  statement: COPY shard_abcde.alerts_data TO
STDOUT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
> 2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,9,"COMMIT",2022-03-17
18:16:54UTC,11/0,0,LOG,00000,"duration: 0.037 ms  statement:
COMMIT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846"
> 2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,10,"idle",2022-03-17
18:16:54UTC,11/0,0,LOG,00000,"starting logical decoding for slot
""pg_6169636_sync_5581717_7073527418723003846""","Streamingtransactions committing after 8F8/8CAB5990, reading WAL from
8F8/8CAB5958.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5581717_7073527418723003846"" LOGICAL 8F8/8CAB5990
(proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" 
> 2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,11,"idle",2022-03-17
18:16:54UTC,11/0,0,LOG,00000,"logical decoding found consistent point at 8F8/8CAB5958","There are no running
transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5581717_7073527418723003846"" LOGICAL 8F8/8CAB5990
(proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" 
> 2022-03-17 18:28:53.915 UTC,,,17706,"10.2.210.180:35794",62337de5.452a,1,"",2022-03-17 18:28:53
UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=35794",,,,,,,,,"" 
> 2022-03-17 18:28:53.923
UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,2,"authentication",2022-03-1718:28:53
UTC,16/17576,0,LOG,00000,"replicationconnection authorized: user=lrep_user
application_name=pg_6169636_sync_5706390_7073527418723003846SSL enabled (protocol=TLSv1.3,
cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 
> 2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,3,"SELECT",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"duration: 0.421 ms  statement: SELECT pg_catalog.set_config('search_path', '',
false);",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,4,"idle",2022-03-17
18:28:53UTC,16/0,0,ERROR,42704,"replication slot ""pg_6169636_sync_5706390_7073527418723003846"" does not
exist",,,,,,"DROP_REPLICATION_SLOTpg_6169636_sync_5706390_7073527418723003846
WAIT",,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:53.926 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,5,"BEGIN",2022-03-17
18:28:53UTC,16/17578,0,LOG,00000,"duration: 0.039 ms  statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:53.958 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,6,"idle in
transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found initial starting point at
8FB/83FCA7D8","Waitingfor transactions (approximately 1) older than 160417372 to end.",,,,,"CREATE_REPLICATION_SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:54.154 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,7,"idle in
transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found consistent point at
8FB/84D20960","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput
USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:54.169 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,8,"SELECT",2022-03-17
18:28:53UTC,16/17578,0,LOG,00000,"duration: 1.184 ms  statement: SELECT c.oid, c.relreplident, c.relkind  FROM
pg_catalog.pg_classc  INNER JOIN pg_catalog.pg_namespace n        ON (c.relnamespace = n.oid) WHERE n.nspname =
'shard_qwerty'  AND c.relname = 'sessions'",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 
> 2022-03-17 18:28:54.173 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,9,"SELECT",2022-03-17
18:28:53UTC,16/17578,0,LOG,00000,"duration: 2.354 ms  statement: SELECT a.attname,       a.atttypid,       a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_index i       ON (i.indexrelid =
pg_get_replica_identity_index(163887255))WHERE a.attnum > 0::pg_catalog.int2   AND NOT a.attisdropped AND
a.attgenerated= ''   AND a.attrelid = 163887255 ORDER BY
a.attnum",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:54.185 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,10,"COPY",2022-03-17
18:28:53UTC,16/17578,0,LOG,00000,"duration: 11.738 ms  statement: COPY shard_qwerty.sessions TO
STDOUT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,11,"COMMIT",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"duration: 0.039 ms  statement:
COMMIT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,12,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"starting logical decoding for slot
""pg_6169636_sync_5706390_7073527418723003846""","Streamingtransactions committing after 8FB/84D20998, reading WAL from
8FB/83EB4658.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5706390_7073527418723003846"" LOGICAL 8FB/84D20998
(proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" 
> 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions
(approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT
""pg_6169636_sync_5706390_7073527418723003846""LOGICAL 8FB/84D20998 (proto_version '1', publication_names
'""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846"
> 2022-03-17 18:28:54.251 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,14,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"logical decoding found consistent point at 8FB/84D20960","There are no running
transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5706390_7073527418723003846"" LOGICAL 8FB/84D20998
(proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" 
> #v-
>
> What can I do about it?

>2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48 UTC,8/447946,23877409,ERROR,53400,"could
notfind free replication state slot for replication origin with OID 51",,"Increase max_replication_slots and try
again.",,,,,,,"","logicalreplication worker",,0 

Each Tablesync worker needs a replication slot to synchronize the data
to the subscriber, this error means there are not enough replication
slots. Try to increase max_replication_slots configuration if it is
set to lower value as you are using around 60k tables.

> 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions
(approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT 

Snapshot builder is waiting for the open transaction to finish. Is it
possible that there are some open transactions that were started
during this logical replication. I felt once the transaction
completes, snapshot building and tablesync operation should finish.

Regards,
Vignesh



Re: Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
On Mon, Mar 21, 2022 at 07:48:20PM +0530, vignesh C wrote:
> > What can I do about it?
> 
> >2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48
UTC,8/447946,23877409,ERROR,53400,"couldnot find free replication state slot for replication origin with OID
51",,"Increasemax_replication_slots and try again.",,,,,,,"","logical replication worker",,0
 
> Each Tablesync worker needs a replication slot to synchronize the data
> to the subscriber, this error means there are not enough replication
> slots. Try to increase max_replication_slots configuration if it is
> set to lower value as you are using around 60k tables.

yeah, but I don't need 60k replication slots?

Please note that replication (initial sync) *IS* working happily for
*some* time, and only then starts "eating" replication slots for
whatever reason.

Specifically - I have max_replication_slots set to 50 (per my original
mail), and sync workers shouldn't exceed 10.

> > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions
(approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT
 
> 
> Snapshot builder is waiting for the open transaction to finish. Is it
> possible that there are some open transactions that were started
> during this logical replication. I felt once the transaction
> completes, snapshot building and tablesync operation should finish.

There are no long open transacitons there. This is test box with *zero*
traffic (aside from running in a loop: create table, drop table), but
*NOT* in transaction.

Best regards,

depesz




On Mon, Mar 21, 2022 at 8:13 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Mar 21, 2022 at 07:48:20PM +0530, vignesh C wrote:
> > > What can I do about it?
> >
> > >2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48
UTC,8/447946,23877409,ERROR,53400,"couldnot find free replication state slot for replication origin with OID
51",,"Increasemax_replication_slots and try again.",,,,,,,"","logical replication worker",,0 
> > Each Tablesync worker needs a replication slot to synchronize the data
> > to the subscriber, this error means there are not enough replication
> > slots. Try to increase max_replication_slots configuration if it is
> > set to lower value as you are using around 60k tables.
>
> yeah, but I don't need 60k replication slots?
>
> Please note that replication (initial sync) *IS* working happily for
> *some* time, and only then starts "eating" replication slots for
> whatever reason.
>
> Specifically - I have max_replication_slots set to 50 (per my original
> mail), and sync workers shouldn't exceed 10.
>
> > > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17
18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions
(approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT 
> >
> > Snapshot builder is waiting for the open transaction to finish. Is it
> > possible that there are some open transactions that were started
> > during this logical replication. I felt once the transaction
> > completes, snapshot building and tablesync operation should finish.
>
> There are no long open transacitons there. This is test box with *zero*
> traffic (aside from running in a loop: create table, drop table), but
> *NOT* in transaction.

I could simulate the similar problem by starting a new transaction
while creating a subscription and notice the following log content:
2022-03-22 08:00:17.868 IST [2680] LOG:  logical decoding found
initial starting point at 0/427DC3D0
2022-03-22 08:00:17.868 IST [2680] DETAIL:  Waiting for transactions
(approximately 1) older than 748 to end.
2022-03-22 08:00:17.868 IST [2680] STATEMENT:  CREATE_REPLICATION_SLOT
"pg_16439_sync_16403_7077745941166836171" LOGICAL pgoutput (SNAPSHOT
'use')

I could see that the following transaction was open(older than 748):
     5 | postgres | 2653 |            |       10 | vignesh | psql
                              |             |                 |
  -1 | 2022-03-22 07:59:46.147802+05:30 | 2022-03-22 08:00:17.394259+
05:30 | 2022-03-22 08:00:23.024178+05:30 | 2022-03-22
08:00:23.024181+05:30 |                 |                     | active
|         747 |          747 |          | select * from
pg_stat_activity ;
                                                        | client backend
     5 | postgres | 2665 |            |          |         |
                              |             |                 |
     | 2022-03-22 08:00:03.005628+05:30 | 2022-03-22 08:00:21.653243+
05:30 | 2022-03-22 08:00:21.653243+05:30 | 2022-03-22
08:00:21.653243+05:30 | Timeout         | VacuumDelay         | active
|             |          747 |          | autovacuum: VACUUM ANALYZE
sch1.t4
  | autovacuum worker

Similarly I could see it from pg_locks:
 transactionid |          |          |      |       |            |
      747 |         |       |          | 6/11               | 2682 |
ShareLock        | f       | f        | 2022-03-22
08:00:17.932021+05:3
0
 relation      |        0 |     1262 |      |       |            |
          |         |       |          | 3/41               | 2653 |
AccessShareLock  | t       | f        |
 transactionid |          |          |      |       |            |
      747 |         |       |          | 7/11               | 2680 |
ShareLock        | f       | f        | 2022-03-22
08:00:17.868615+05:3
0
 transactionid |          |          |      |       |            |
      747 |         |       |          | 3/41               | 2653 |
ExclusiveLock    | t       | f        |

I'm not sure if this is the same scenario or not.
Could you also confirm from pg_stat_activity and pg_locks if there are
any open transactions and if there is any open transaction, could you
see this transaction is on which relation and are these because of the
table sync workers?

Also as an alternative you might be able to use pg_dump to take a
backup and restore it to the destination if there are no transactions
and create subscription with copy_data as off.

Regards,
Vignesh



Re: Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
On Tue, Mar 22, 2022 at 08:37:07AM +0530, vignesh C wrote:
> I could simulate the similar problem by starting a new transaction
> while creating a subscription and notice the following log content:
...
> I'm not sure if this is the same scenario or not.
> Could you also confirm from pg_stat_activity and pg_locks if there are
> any open transactions and if there is any open transaction, could you
> see this transaction is on which relation and are these because of the
> table sync workers?

I don't think it is the same scenario because this is totally isolated
server, with no applications running, cron stopped and no access from
outside whatsoever.
Also, even if I'd assume that some open transaction can pause initial
sync, which is reasonable, why would it start to leave open replication
slots, and start new ones, above number specified in
max_sync_workers_per_subscription?

> Also as an alternative you might be able to use pg_dump to take a
> backup and restore it to the destination if there are no transactions
> and create subscription with copy_data as off.

That wouldn't do me any good. The server I'm trying to upgrade now is
test bench to make sure that I can upgrade using logical replication *at
all*.
Once I will have verified that I can, I have couple hundred servers to
upgrade, that will have concurrent access all the time.

If the problem was: upgrade server that doesn't have any traffic,
I could just pg_dump + pg_restore, and be done with it. Unfortunately
this is not the case.

Best regards,

depesz




On Tue, Mar 22, 2022 at 9:18 AM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Mar 22, 2022 at 08:37:07AM +0530, vignesh C wrote:
> > I could simulate the similar problem by starting a new transaction
> > while creating a subscription and notice the following log content:
> ...
> > I'm not sure if this is the same scenario or not.
> > Could you also confirm from pg_stat_activity and pg_locks if there are
> > any open transactions and if there is any open transaction, could you
> > see this transaction is on which relation and are these because of the
> > table sync workers?
>
> I don't think it is the same scenario because this is totally isolated
> server, with no applications running, cron stopped and no access from
> outside whatsoever.
> Also, even if I'd assume that some open transaction can pause initial
> sync, which is reasonable, why would it start to leave open replication
> slots, and start new ones, above number specified in
> max_sync_workers_per_subscription?
>
> > Also as an alternative you might be able to use pg_dump to take a
> > backup and restore it to the destination if there are no transactions
> > and create subscription with copy_data as off.
>
> That wouldn't do me any good. The server I'm trying to upgrade now is
> test bench to make sure that I can upgrade using logical replication *at
> all*.
> Once I will have verified that I can, I have couple hundred servers to
> upgrade, that will have concurrent access all the time.

Is it possible to get  a reproducible test or a script for this problem?

Regards,
Vignesh



Re: Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote:
> > That wouldn't do me any good. The server I'm trying to upgrade now is
> > test bench to make sure that I can upgrade using logical replication *at
> > all*.
> > Once I will have verified that I can, I have couple hundred servers to
> > upgrade, that will have concurrent access all the time.
> Is it possible to get  a reproducible test or a script for this problem?

Well, given that we're talking about 38GB database with 60k tables
i find that unlikely.

But I'll try if I can figure out a way to replicate the problem with
something that is shareable.

depesz



Re: Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote:
> On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote:
> > > That wouldn't do me any good. The server I'm trying to upgrade now is
> > > test bench to make sure that I can upgrade using logical replication *at
> > > all*.
> > > Once I will have verified that I can, I have couple hundred servers to
> > > upgrade, that will have concurrent access all the time.
> > Is it possible to get  a reproducible test or a script for this problem?
> 
> Well, given that we're talking about 38GB database with 60k tables
> i find that unlikely.
> 
> But I'll try if I can figure out a way to replicate the problem with
> something that is shareable.

OK. I don't think it will be simple/possible to replicate.

Reason:

Did setup 10 separate environments, all with the same database on bionic
12.9.

Then I created publication with all tables in specific schemas, and
subscribed to it from separate focal pg 14.2 servers.

In total, used 20 servers to replicate the test 10 times.

Effect: in one case replication didn't finish initial sync.

Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that
had failed with subscription i see:
                  slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 
 

---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
 focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    161822728 | 8E9/3E70BB58 | 8E9/3E70BB58
 
 pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32385|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32378|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32386|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32387|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32379|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32380|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32381|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32382|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32383|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32384|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
(11 rows)

pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn,
write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is
changing.

pg_stat_activity contains:

=# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle';
          xact_start           │          query_start          │         backend_type         │
                                          query                                                                     
 

═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 [null]                        │ [null]                        │ autovacuum launcher          │ 
 [null]                        │ 2022-03-24 20:45:30.757582+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_2993718_7077521990523695096"LOGICAL 8E9/40FF0170 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ [null]                        │ pglogical supervisor         │ 
 [null]                        │ [null]                        │ logical replication launcher │ 
 [null]                        │ 2022-03-24 20:45:30.730818+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3343043_7077521990523695096"LOGICAL 8E9/40FF0100 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.65015+00  │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3079712_7077521990523695096"LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:30.733366+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3145162_7077521990523695096"LOGICAL 8E9/40FF0138 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:30.76285+00  │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3355696_7077521990523695096"LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.663299+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3099460_7077521990523695096"LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.676412+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3162651_7077521990523695096"LOGICAL 8E9/40FF0020 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.691807+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3157454_7077521990523695096"LOGICAL 8E9/40FF0058 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.69548+00  │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3084198_7077521990523695096"LOGICAL 8E9/40FF0090 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:30.974833+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3142573_7077521990523695096"LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names '"focal14"')
 
 2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend               │ select xact_start,
query_start,backend_type, query from pg_stat_activity where state is distinct from 'idle';
 
 [null]                        │ [null]                        │ background writer            │ 
 [null]                        │ [null]                        │ checkpointer                 │ 
 [null]                        │ [null]                        │ walwriter                    │ 
(17 rows)

So, it looks to me that there is some random-ish condition that causes
the problem.

I tried to insulate pgs from the world as well as I could: no
monitoring, no cron, no external access. Just replication and my simple
load generator, running loop:

CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i;
CREATE INDEX depesz_test_idx on depesz_test(i);
DROP TABLE depesz_test;

every minute.

I'm out of ideas. The good-ish news is that when I tried replication in
smaller sets (30k tables, not 60k), and/or with just
2 max_sync_workers_per_subscription, and not 10, it never failed. so
I guess I'll go that way.

Best regards,

depesz





Re: Logical replication stops dropping used initial-sync replication slots

From
Amit Kapila
Date:
On Fri, Mar 25, 2022 at 2:09 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote:
> > On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote:
> > > > That wouldn't do me any good. The server I'm trying to upgrade now is
> > > > test bench to make sure that I can upgrade using logical replication *at
> > > > all*.
> > > > Once I will have verified that I can, I have couple hundred servers to
> > > > upgrade, that will have concurrent access all the time.
> > > Is it possible to get  a reproducible test or a script for this problem?
> >
> > Well, given that we're talking about 38GB database with 60k tables
> > i find that unlikely.
> >
> > But I'll try if I can figure out a way to replicate the problem with
> > something that is shareable.
>
> OK. I don't think it will be simple/possible to replicate.
>
> Reason:
>
> Did setup 10 separate environments, all with the same database on bionic
> 12.9.
>
> Then I created publication with all tables in specific schemas, and
> subscribed to it from separate focal pg 14.2 servers.
>
> In total, used 20 servers to replicate the test 10 times.
>
> Effect: in one case replication didn't finish initial sync.
>
> Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that
> had failed with subscription i see:
>                   slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 
>
---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
>  focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
   |      |    161822728 | 8E9/3E70BB58 | 8E9/3E70BB58 
>  pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32385|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32378|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32386|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32387|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32379|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32380|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32381|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32382|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32383|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
>  pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32384|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0 
> (11 rows)
>
> pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn,
> write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is
> changing.
>
> pg_stat_activity contains:
>
> =# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle';
>           xact_start           │          query_start          │         backend_type         │
                                            query 
>
═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
>  [null]                        │ [null]                        │ autovacuum launcher          │
>  [null]                        │ 2022-03-24 20:45:30.757582+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_2993718_7077521990523695096" LOGICAL 8E9/40FF0170 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ [null]                        │ pglogical supervisor         │
>  [null]                        │ [null]                        │ logical replication launcher │
>  [null]                        │ 2022-03-24 20:45:30.730818+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3343043_7077521990523695096" LOGICAL 8E9/40FF0100 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.65015+00  │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3079712_7077521990523695096" LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:30.733366+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3145162_7077521990523695096" LOGICAL 8E9/40FF0138 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:30.76285+00  │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3355696_7077521990523695096" LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.663299+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3099460_7077521990523695096" LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.676412+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3162651_7077521990523695096" LOGICAL 8E9/40FF0020 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.691807+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3157454_7077521990523695096" LOGICAL 8E9/40FF0058 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:31.69548+00  │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3084198_7077521990523695096" LOGICAL 8E9/40FF0090 (proto_version '1', publication_names
'"focal14"')
>  [null]                        │ 2022-03-24 20:45:30.974833+00 │ walsender                    │ START_REPLICATION
SLOT"pg_3548429_sync_3142573_7077521990523695096" LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names
'"focal14"')
>  2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend               │ select xact_start,
query_start,backend_type, query from pg_stat_activity where state is distinct from 'idle'; 
>  [null]                        │ [null]                        │ background writer            │
>  [null]                        │ [null]                        │ checkpointer                 │
>  [null]                        │ [null]                        │ walwriter                    │
> (17 rows)
>
> So, it looks to me that there is some random-ish condition that causes
> the problem.
>
> I tried to insulate pgs from the world as well as I could: no
> monitoring, no cron, no external access. Just replication and my simple
> load generator, running loop:
>
> CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i;
> CREATE INDEX depesz_test_idx on depesz_test(i);
> DROP TABLE depesz_test;
>
> every minute.
>
> I'm out of ideas. The good-ish news is that when I tried replication in
> smaller sets (30k tables, not 60k), and/or with just
> 2 max_sync_workers_per_subscription, and not 10, it never failed. so
> I guess I'll go that way.
>

What was your setting for max_logical_replication_workers? It should
be more than max_sync_workers_per_subscription as that is taken from
the pool of max_logical_replication_workers. One possible reason for
some inactive slots could be that some tablesync workers exit without
finishing the initial sync and in the meantime the new tablesync
workers started and the old ones didn't get time to finish.
Eventually, you will get the error: "Increase max_replication_slots
and try again" because we also have restrictions on the number of
origins equal to max_replication_slots. We create one origin for each
copy.

It is good if you are able to proceed but we might get some clue about
the problem you are seeing if you can show subscriber-side logs.

--
With Regards,
Amit Kapila.



Re: Logical replication stops dropping used initial-sync replication slots

From
hubert depesz lubaczewski
Date:
On Tue, Mar 29, 2022 at 04:57:42PM +0530, Amit Kapila wrote:
> What was your setting for max_logical_replication_workers? It should

4 on source, and 50 on replica. That could be related. Should I increase this on source?

> be more than max_sync_workers_per_subscription as that is taken from
> the pool of max_logical_replication_workers. One possible reason for

The thing is that, in next tests, it seemed to *usually* work. And break
just sometimes.

> It is good if you are able to proceed but we might get some clue about
> the problem you are seeing if you can show subscriber-side logs.

Will keep that in mind. The boxes I was testing it are gone, but if the
problem will resurface, I'll make the logs available.

Best regards,

depesz




Re: Logical replication stops dropping used initial-sync replication slots

From
Amit Kapila
Date:
On Tue, Mar 29, 2022 at 5:31 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Mar 29, 2022 at 04:57:42PM +0530, Amit Kapila wrote:
> > What was your setting for max_logical_replication_workers? It should
>
> 4 on source, and 50 on replica. That could be related. Should I increase this on source?
>

No that won't matter. The number on the replica only matters which
seems to be okay in your case.

> > be more than max_sync_workers_per_subscription as that is taken from
> > the pool of max_logical_replication_workers. One possible reason for
>
> The thing is that, in next tests, it seemed to *usually* work. And break
> just sometimes.
>

It is possible that some random error is occurring during the initial
copy, difficult to say without seeing the actual error.

> > It is good if you are able to proceed but we might get some clue about
> > the problem you are seeing if you can show subscriber-side logs.
>
> Will keep that in mind. The boxes I was testing it are gone, but if the
> problem will resurface, I'll make the logs available.
>

Thanks, that might help investigate the problem you are facing.

-- 
With Regards,
Amit Kapila.