Thread: Logical replication stops dropping used initial-sync replication slots
Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
Hi, I have following case: Primary: Ubuntu Bionic with Pg 12.9, and ~ 60k tables. Replica: Ubuntu Focal with Pg 14.2. Settings are: max_wal_senders │ 50 wal_sender_timeout │ 60000 max_sync_workers_per_subscription │ 10 max_worker_processes │ 50 I made publication using: create publication focal14; select format('ALTER PUBLICATION focal14 ADD TABLE ONLY %I.%I;', n.nspname, c.relname) from pg_class c join pg_namespacen on c.relnamespace = n.oid where n.nspname ~ 'shard' and c.relkind in ('p', 'r') order by n.nspname, c.relname\gexec Worked. Then on replica I did: create subscription focal14 connection 'host=${source_ip} port=5432 user=lrep_user dbname=${dbname}' publication focal14; This started initial sync. Happily it was working with 10 concurrent workers, but after ~ hour, I started to get, in logs on replica: 2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48 UTC,8/447946,23877409,ERROR,53400,"could notfind free replication state slot for replication origin with OID 51",,"Increase max_replication_slots and try again.",,,,,,,"","logicalreplication worker",,0 And sure enough, there are MANY replication slots on primary: slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn ---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+--------------------- focal14 | pgoutput | logical | 16606 | dbname | f | f | | | 160319873 | 8B7/89D938C0 | 8B7/93FD2380 pg_4962101_sync_4663369_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14139| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4794836_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160320058 | 8B8/27FE6668 | 8B8/27FE66A0 pg_4962101_sync_4613771_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14129| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4681288_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160320058 | 8B8/27FE6630 | 8B8/27FE6668 pg_4962101_sync_4596929_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160320058 | 8B8/27FE65F8 | 8B8/27FE6630 pg_4962101_sync_4394969_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14131| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4402927_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14130| | 160322211 | 8BF/4D9EC3F0 | 8BF/5EFFDE90 pg_4962101_sync_4458658_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160320058 | 8B8/27FE66A0 | 8B8/27FE66D8 pg_4962101_sync_4374836_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14132| | 160322211 | 8BF/4D9EC3F0 | 8BF/5EFFDE90 pg_4962101_sync_4562514_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14137| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4588090_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14138| | 160322211 | 8BF/4D9EC3F0 | 8BF/5EFFDE90 pg_4962101_sync_4640031_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4188 | 8B7/940D41C0 pg_4962101_sync_4602213_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D41C0 | 8B7/940D41F8 pg_4962101_sync_4530451_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D41F8 | 8B7/940D4230 pg_4962101_sync_4667494_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4230 | 8B7/940D4268 pg_4962101_sync_4452706_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4268 | 8B7/940D42A0 pg_4962101_sync_4553748_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D42A0 | 8B7/940D42D8 pg_4962101_sync_4592679_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D42D8 | 8B7/940D4310 pg_4962101_sync_4600185_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4310 | 8B7/940D4348 pg_4962101_sync_4659756_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4348 | 8B7/940D4380 pg_4962101_sync_4389357_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4380 | 8B7/940D43B8 pg_4962101_sync_4773549_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D43B8 | 8B7/940D43F0 pg_4962101_sync_4784350_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D43F0 | 8B7/940D4428 pg_4962101_sync_4794832_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4428 | 8B7/940D4460 pg_4962101_sync_4678346_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4460 | 8B7/940D4498 pg_4962101_sync_4703039_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4498 | 8B7/940D44D0 pg_4962101_sync_4496803_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D44D0 | 8B7/940D4508 pg_4962101_sync_4612692_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4508 | 8B7/940D4540 pg_4962101_sync_4779832_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4540 | 8B7/940D4578 pg_4962101_sync_4569154_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4578 | 8B7/940D45B0 pg_4962101_sync_4561772_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D45B0 | 8B7/940D45E8 pg_4962101_sync_4452829_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D45E8 | 8B7/940D4620 pg_4962101_sync_4490381_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4620 | 8B7/940D4658 pg_4962101_sync_4585733_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4658 | 8B7/940D4690 pg_4962101_sync_4459745_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14140| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4508354_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14141| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4451993_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 14142| | 160322211 | 8BF/4D9EC3F0 | 8BF/5F79B708 pg_4962101_sync_4491715_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4770 | 8B7/940D47A8 pg_4962101_sync_4404697_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D47A8 | 8B7/940D47E0 pg_4962101_sync_4520525_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D47E0 | 8B7/940D4818 pg_4962101_sync_4508992_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4818 | 8B7/940D4850 pg_4962101_sync_4474811_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4850 | 8B7/940D4888 pg_4962101_sync_4514152_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4888 | 8B7/940D48C0 pg_4962101_sync_4379697_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D48C0 | 8B7/940D48F8 pg_4962101_sync_4677962_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D48F8 | 8B7/940D4930 pg_4962101_sync_4554683_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4930 | 8B7/940D4968 pg_4962101_sync_4450272_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D4968 | 8B7/940D49A0 pg_4962101_sync_4717161_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D49A0 | 8B7/940D49D8 pg_4962101_sync_4784928_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160319884 | 8B7/940D49D8 | 8B7/940D4A10 (50 rows) For some reason there are MANY inactive slots. But why? I checked in logs for some slot that worked well, and found: 2022-03-17 12:59:58.069 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,2,"authentication",2022-03-17 12:59:58UTC,13/587955,0,LOG,00000,"replication connection authorized: user=lrep_user application_name=pg_4962101_sync_4512642_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 2022-03-17 12:59:58.071 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,3,"SELECT",2022-03-17 12:59:58 UTC,13/0,0,LOG,00000,"duration:0.381 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 12:59:58.072 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,4,"BEGIN",2022-03-17 12:59:58 UTC,13/587957,0,LOG,00000,"duration:0.043 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 12:59:58.085 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,5,"idle in transaction",2022-03-17 12:59:58UTC,13/587957,0,LOG,00000,"logical decoding found consistent point at 8AF/6FDC8110","There are no running transactions.",,,,,"CREATE_REPLICATION_SLOT""pg_4962101_sync_4512642_7073527418723003846"" LOGICAL pgoutput USE_SNAPSHOT",,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 13:00:00.126 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,6,"SELECT",2022-03-17 12:59:58 UTC,13/587957,0,LOG,00000,"duration:1.402 ms statement: SELECT c.oid, c.relreplident, c.relkind FROM pg_catalog.pg_classc INNER JOIN pg_catalog.pg_namespace n ON (c.relnamespace = n.oid) WHERE n.nspname = 'aaa_shard_bbb' AND c.relname = 'some_table'",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 13:00:00.130 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,7,"SELECT",2022-03-17 12:59:58 UTC,13/587957,0,LOG,00000,"duration:3.495 ms statement: SELECT a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(163890290))WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated= '' AND a.attrelid = 163890290 ORDER BY a.attnum",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 13:00:00.134 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,8,"COPY",2022-03-17 12:59:58 UTC,13/587957,0,LOG,00000,"duration:3.525 ms statement: COPY aaa_shard_bbb.some_table TO STDOUT",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 13:00:00.135 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,9,"COMMIT",2022-03-17 12:59:58 UTC,13/0,0,LOG,00000,"duration:0.039 ms statement: COMMIT",,,,,,,,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 13:00:01.162 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,10,"idle",2022-03-17 12:59:58 UTC,13/0,0,LOG,00000,"startinglogical decoding for slot ""pg_4962101_sync_4512642_7073527418723003846""","Streaming transactionscommitting after 8AF/6FDC8148, reading WAL from 8AF/6FDC8110.",,,,,"START_REPLICATION SLOT ""pg_4962101_sync_4512642_7073527418723003846""LOGICAL 8AF/6FDC8148 (proto_version '1', publication_names '""focal14""')",,,"pg_4962101_sync_4512642_7073527418723003846" 2022-03-17 13:00:01.162 UTC,"lrep_user","dbname",13435,"SOME_IP:50816",623330ce.347b,11,"idle",2022-03-17 12:59:58 UTC,13/0,0,LOG,00000,"logicaldecoding found consistent point at 8AF/6FDC8110","There are no running transactions.",,,,,"START_REPLICATIONSLOT ""pg_4962101_sync_4512642_7073527418723003846"" LOGICAL 8AF/6FDC8148 (proto_version'1', publication_names '""focal14""')",,,"pg_4962101_sync_4512642_7073527418723003846" Not sure when this slot got removed, but it's no longer there. But, a slot that currently exists and is inactive: 2022-03-17 13:32:48.401 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,2,"authentication",2022-03-17 13:32:48UTC,11/150184,0,LOG,00000,"replication connection authorized: user=lrep_user application_name=pg_4962101_sync_4474811_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 2022-03-17 13:32:48.403 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,3,"SELECT",2022-03-17 13:32:48 UTC,11/0,0,LOG,00000,"duration:0.408 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_4962101_sync_4474811_7073527418723003846" 2022-03-17 13:32:48.405 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,4,"BEGIN",2022-03-17 13:32:48 UTC,11/150186,0,LOG,00000,"duration:0.061 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_4962101_sync_4474811_7073527418723003846" 2022-03-17 13:32:48.419 UTC,"lrep_user","dbname",28146,"SOME_IP:37178",62333880.6df2,5,"idle in transaction",2022-03-17 13:32:48UTC,11/150186,0,LOG,00000,"logical decoding found consistent point at 8B7/940D4850","There are no running transactions.",,,,,"CREATE_REPLICATION_SLOT""pg_4962101_sync_4474811_7073527418723003846"" LOGICAL pgoutput USE_SNAPSHOT",,,"pg_4962101_sync_4474811_7073527418723003846" and some that is currently active: 2022-03-17 13:32:48.349 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,2,"authentication",2022-03-17 13:32:48UTC,11/150181,0,LOG,00000,"replication connection authorized: user=lrep_user application_name=pg_4962101_sync_4663369_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 2022-03-17 13:32:48.354 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,3,"SELECT",2022-03-17 13:32:48UTC,11/0,0,LOG,00000,"duration: 0.397 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:32:48.356 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,4,"BEGIN",2022-03-17 13:32:48 UTC,11/150183,0,LOG,00000,"duration:0.049 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:32:48.381 UTC,"lrep_user","dbname",28138,"10.2.210.180:37160",62333880.6dea,5,"idle in transaction",2022-03-1713:32:48 UTC,11/150183,0,LOG,00000,"logical decoding found consistent point at 8B7/940D4690","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:11.788 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,2,"authentication",2022-03-17 13:35:11UTC,11/150814,0,LOG,00000,"replication connection authorized: user=lrep_user application_name=pg_4962101_sync_4663369_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 2022-03-17 13:35:11.790 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,3,"SELECT",2022-03-17 13:35:11UTC,11/0,0,LOG,00000,"duration: 0.414 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:11.793 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,4,"BEGIN",2022-03-17 13:35:11 UTC,11/150816,0,LOG,00000,"duration:0.058 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:11.805 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,5,"idle in transaction",2022-03-1713:35:11 UTC,11/150816,0,LOG,00000,"logical decoding found initial starting point at 8B8/2D0B3DB0","Waitingfor transactions (approximately 1) older than 160320063 to end.",,,,,"CREATE_REPLICATION_SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:11.915 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,6,"idle in transaction",2022-03-1713:35:11 UTC,11/150816,0,LOG,00000,"logical decoding found initial consistent point at 8B8/2D1859D0","Waitingfor transactions (approximately 1) older than 160320064 to end.",,,,,"CREATE_REPLICATION_SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.035 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,7,"idle in transaction",2022-03-1713:35:11 UTC,11/150816,0,LOG,00000,"logical decoding found consistent point at 8B8/2D1874C0","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.050 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,8,"SELECT",2022-03-17 13:35:11UTC,11/150816,0,LOG,00000,"duration: 1.056 ms statement: SELECT c.oid, c.relreplident, c.relkind FROM pg_catalog.pg_classc INNER JOIN pg_catalog.pg_namespace n ON (c.relnamespace = n.oid) WHERE n.nspname = 'aaa_shard_ccc' AND c.relname = 'OTHER_TABLE'",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.055 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,9,"SELECT",2022-03-17 13:35:11UTC,11/150816,0,LOG,00000,"duration: 3.441 ms statement: SELECT a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(308468275))WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated= '' AND a.attrelid = 308468275 ORDER BY a.attnum",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.063 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,10,"COPY",2022-03-17 13:35:11 UTC,11/150816,0,LOG,00000,"duration:7.618 ms statement: COPY aaa_shard_ccc.OTHER_TABLE TO STDOUT",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.069 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,11,"COMMIT",2022-03-17 13:35:11UTC,11/0,0,LOG,00000,"duration: 0.051 ms statement: COMMIT",,,,,,,,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.069 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,12,"idle",2022-03-17 13:35:11 UTC,11/0,0,LOG,00000,"startinglogical decoding for slot ""pg_4962101_sync_4663369_7073527418723003846""","Streaming transactionscommitting after 8B8/2D1874F8, reading WAL from 8B8/2D0AB818.",,,,,"START_REPLICATION SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL 8B8/2D1874F8 (proto_version '1', publication_names '""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.069 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,13,"idle",2022-03-17 13:35:11 UTC,11/0,0,LOG,00000,"logicaldecoding found initial starting point at 8B8/2D0B3DB0","Waiting for transactions (approximately1) older than 160320063 to end.",,,,,"START_REPLICATION SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL 8B8/2D1874F8 (proto_version '1', publication_names '""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.083 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,14,"idle",2022-03-17 13:35:11 UTC,11/0,0,LOG,00000,"logicaldecoding found initial consistent point at 8B8/2D1859D0","Waiting for transactions (approximately1) older than 160320064 to end.",,,,,"START_REPLICATION SLOT ""pg_4962101_sync_4663369_7073527418723003846""LOGICAL 8B8/2D1874F8 (proto_version '1', publication_names '""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846" 2022-03-17 13:35:12.090 UTC,"lrep_user","dbname",14139,"10.2.210.180:43592",6233390f.373b,15,"idle",2022-03-17 13:35:11 UTC,11/0,0,LOG,00000,"logicaldecoding found consistent point at 8B8/2D1874C0","There are no running transactions.",,,,,"START_REPLICATIONSLOT ""pg_4962101_sync_4663369_7073527418723003846"" LOGICAL 8B8/2D1874F8 (proto_version'1', publication_names '""focal14""')",,,"pg_4962101_sync_4663369_7073527418723003846" Looks that, for some reason, it doesn't get data on next table to copy data from, and is kinda stuck in process... What could it be? How can I fix that? We need to use this (logical replication) as we want to upgrade OS, and there is locale problem, which prohibits usage of physical replication. Best regards, depesz
Re: Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
On Thu, Mar 17, 2022 at 03:24:26PM +0100, hubert depesz lubaczewski wrote: > Hi, > I have following case: > > Primary: Ubuntu Bionic with Pg 12.9, and ~ 60k tables. > Replica: Ubuntu Focal with Pg 14.2. Upgraded primary to 12.10, redid everything. Same problem. Interestingly, now when I look at pg_replication_slots, I see: #v+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn ---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+--------------------- focal14 | pgoutput | logical | 16606 | dbname | f | f | | | 160417199 | 8FA/F601EF68 | 8FA/F87F13E8 pg_6169636_sync_5600099_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17696| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5762839_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17697| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5953088_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17698| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5735455_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17699| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5873494_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3008 | 8FA/F88F3040 pg_6169636_sync_5913846_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17700| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5664472_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17701| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5944971_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17702| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5887578_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17707| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5597814_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17703| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5706390_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17706| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 pg_6169636_sync_5900789_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3158 | 8FA/F88F3190 pg_6169636_sync_5719614_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3190 | 8FA/F88F31C8 pg_6169636_sync_5754127_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F31C8 | 8FA/F88F3200 pg_6169636_sync_5667313_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3200 | 8FA/F88F3238 pg_6169636_sync_5904579_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3238 | 8FA/F88F3270 pg_6169636_sync_5922710_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3270 | 8FA/F88F32A8 pg_6169636_sync_5698143_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F32A8 | 8FA/F88F32E0 pg_6169636_sync_5579916_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F32E0 | 8FA/F88F3318 pg_6169636_sync_5938883_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3318 | 8FA/F88F3350 pg_6169636_sync_5654015_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3350 | 8FA/F88F3388 pg_6169636_sync_5762125_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3388 | 8FA/F88F33C0 pg_6169636_sync_5760797_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F33C0 | 8FA/F88F33F8 pg_6169636_sync_5851313_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F33F8 | 8FA/F88F3430 pg_6169636_sync_5916213_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3430 | 8FA/F88F3468 pg_6169636_sync_5999341_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3468 | 8FA/F88F34A0 pg_6169636_sync_5896729_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 pg_6169636_sync_5748694_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 pg_6169636_sync_5603063_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/782067B8 | 8FB/782067F0 pg_6169636_sync_5859189_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3510 | 8FA/F88F3548 pg_6169636_sync_5740988_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/782067F0 | 8FB/78206828 pg_6169636_sync_5780890_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/78206828 | 8FB/78206860 pg_6169636_sync_5900300_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/78206860 | 8FB/78206898 pg_6169636_sync_5983861_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/78609D40 | 8FB/7A9D6028 pg_6169636_sync_5704266_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/7A5C90B0 | 8FB/7A9D6028 pg_6169636_sync_5950291_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD60C0 | 8FB/7AAD60F8 pg_6169636_sync_5687359_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD60F8 | 8FB/7AAD6130 pg_6169636_sync_5591454_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F37B8 | 8FA/F88F37F0 pg_6169636_sync_5695446_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F37F0 | 8FA/F88F3828 pg_6169636_sync_5654295_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3828 | 8FA/F88F3860 pg_6169636_sync_5959324_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3828 | 8FA/F88F3860 pg_6169636_sync_5653706_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3898 | 8FA/F88F38D0 pg_6169636_sync_5625296_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F38D0 | 8FA/F88F3908 pg_6169636_sync_5599947_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD6130 | 8FB/7AAD6168 pg_6169636_sync_5900341_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD6168 | 8FB/7AAD61A0 pg_6169636_sync_5657835_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD61A0 | 8FB/7AAD61D8 pg_6169636_sync_5929106_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD61D8 | 8FB/7AAD6210 (48 rows) #v- These 10 "active" slots. When I checked their data in pg_stat_activity, it looks that these are stuck in "START_REPLICATIONSLOT": #v+ =# select pid, application_name, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event, state,backend_xid, backend_xmin, query, backend_type from pg_stat_activity where pid in (select active_pid from pg_replication_slots where active); pid │ application_name │ backend_start │ xact_start │ query_start │ state_change │ wait_event_type │ wait_event │ state │ backend_xid │ backend_xmin│ query │ backend_type ═══════╪═════════════════════════════════════════════╪═══════════════════════════════╪════════════╪═══════════════════════════════╪═══════════════════════════════╪═════════════════╪═════════════════════╪════════╪═════════════╪══════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════ 17696 │ pg_6169636_sync_5600099_7073527418723003846 │ 2022-03-17 18:28:53.83822+00 │ [null] │ 2022-03-17 18:28:54.178221+00│ 2022-03-17 18:28:54.178229+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5600099_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17697 │ pg_6169636_sync_5762839_7073527418723003846 │ 2022-03-17 18:28:53.840978+00 │ [null] │ 2022-03-17 18:28:54.182595+00│ 2022-03-17 18:28:54.182605+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5762839_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17698 │ pg_6169636_sync_5953088_7073527418723003846 │ 2022-03-17 18:28:53.851774+00 │ [null] │ 2022-03-17 18:28:54.182806+00│ 2022-03-17 18:28:54.182813+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5953088_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17699 │ pg_6169636_sync_5735455_7073527418723003846 │ 2022-03-17 18:28:53.862408+00 │ [null] │ 2022-03-17 18:28:54.181037+00│ 2022-03-17 18:28:54.181046+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5735455_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17700 │ pg_6169636_sync_5913846_7073527418723003846 │ 2022-03-17 18:28:53.872853+00 │ [null] │ 2022-03-17 18:28:54.179817+00│ 2022-03-17 18:28:54.179825+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5913846_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17701 │ pg_6169636_sync_5664472_7073527418723003846 │ 2022-03-17 18:28:53.883557+00 │ [null] │ 2022-03-17 18:28:54.175043+00│ 2022-03-17 18:28:54.175053+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5664472_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17707 │ pg_6169636_sync_5887578_7073527418723003846 │ 2022-03-17 18:28:53.948718+00 │ [null] │ 2022-03-17 18:28:54.176391+00│ 2022-03-17 18:28:54.1764+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5887578_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17702 │ pg_6169636_sync_5944971_7073527418723003846 │ 2022-03-17 18:28:53.894247+00 │ [null] │ 2022-03-17 18:28:54.177971+00│ 2022-03-17 18:28:54.177979+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5944971_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17703 │ pg_6169636_sync_5597814_7073527418723003846 │ 2022-03-17 18:28:53.90505+00 │ [null] │ 2022-03-17 18:28:54.183205+00│ 2022-03-17 18:28:54.183213+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5597814_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender 17706 │ pg_6169636_sync_5706390_7073527418723003846 │ 2022-03-17 18:28:53.915736+00 │ [null] │ 2022-03-17 18:28:54.186649+00│ 2022-03-17 18:28:54.186658+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5706390_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender (10 rows) #v- Please note that now() when I run these queries is ~ 2022-03-18 12:19:36.902819+00 Which means that these backends are stuck for ~ 18 hours now. Description of wait event "WalSenderWaitForWAL" "Waiting for WAL to be flushed in WAL sender process." doesn't seem to be OK, WAL writes are working OK: #v+ =# select *, pg_current_wal_lsn() from pg_control_checkpoint() \gx ─[ RECORD 1 ]────────┬───────────────────────── checkpoint_lsn │ 9C7/3E1568B0 redo_lsn │ 9C7/1F7988D0 redo_wal_file │ 00000002000009C70000001F timeline_id │ 2 prev_timeline_id │ 2 full_page_writes │ t next_xid │ 0:160478730 next_oid │ 310866725 next_multixact_id │ 167477 next_multi_offset │ 501565 oldest_xid │ 479 oldest_xid_dbid │ 13398 oldest_active_xid │ 160478730 oldest_multi_xid │ 1 oldest_multi_dbid │ 16606 oldest_commit_ts_xid │ 602 newest_commit_ts_xid │ 160478729 checkpoint_time │ 2022-03-18 12:17:03+00 pg_current_wal_lsn │ 9C8/4EFC31E0 #v- Archiving works, all seems to be well. But replication slots froze, and then replica started bunch more. If that would help I can find all things related to these pids (of active slots), but it might be too much. For now, one, random, pid: #v+ 2022-03-17 18:16:54.684 UTC,,,17706,"10.2.210.180:44612",62337b16.452a,1,"",2022-03-17 18:16:54 UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=44612",,,,,,,,,"" 2022-03-17 18:16:54.691 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,2,"authentication",2022-03-17 18:16:54UTC,11/57844,0,LOG,00000,"replication connection authorized: user=lrep_user application_name=pg_6169636_sync_5581717_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 2022-03-17 18:16:54.693 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,3,"SELECT",2022-03-17 18:16:54UTC,11/0,0,LOG,00000,"duration: 0.395 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.694 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,4,"BEGIN",2022-03-17 18:16:54 UTC,11/57846,0,LOG,00000,"duration:0.050 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.710 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,5,"idle in transaction",2022-03-1718:16:54 UTC,11/57846,0,LOG,00000,"logical decoding found consistent point at 8F8/8CAB5958","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT ""pg_6169636_sync_5581717_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.715 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,6,"SELECT",2022-03-17 18:16:54UTC,11/57846,0,LOG,00000,"duration: 1.130 ms statement: SELECT c.oid, c.relreplident, c.relkind FROM pg_catalog.pg_classc INNER JOIN pg_catalog.pg_namespace n ON (c.relnamespace = n.oid) WHERE n.nspname = 'shard_abcde' AND c.relname = 'alerts_data'",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.720 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,7,"SELECT",2022-03-17 18:16:54UTC,11/57846,0,LOG,00000,"duration: 4.753 ms statement: SELECT a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(164329186))WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated= '' AND a.attrelid = 164329186 ORDER BY a.attnum",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,8,"COPY",2022-03-17 18:16:54 UTC,11/57846,0,LOG,00000,"duration:1.878 ms statement: COPY shard_abcde.alerts_data TO STDOUT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,9,"COMMIT",2022-03-17 18:16:54UTC,11/0,0,LOG,00000,"duration: 0.037 ms statement: COMMIT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,10,"idle",2022-03-17 18:16:54 UTC,11/0,0,LOG,00000,"startinglogical decoding for slot ""pg_6169636_sync_5581717_7073527418723003846""","Streaming transactionscommitting after 8F8/8CAB5990, reading WAL from 8F8/8CAB5958.",,,,,"START_REPLICATION SLOT ""pg_6169636_sync_5581717_7073527418723003846""LOGICAL 8F8/8CAB5990 (proto_version '1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,11,"idle",2022-03-17 18:16:54 UTC,11/0,0,LOG,00000,"logicaldecoding found consistent point at 8F8/8CAB5958","There are no running transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5581717_7073527418723003846"" LOGICAL 8F8/8CAB5990 (proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" 2022-03-17 18:28:53.915 UTC,,,17706,"10.2.210.180:35794",62337de5.452a,1,"",2022-03-17 18:28:53 UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=35794",,,,,,,,,"" 2022-03-17 18:28:53.923 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,2,"authentication",2022-03-17 18:28:53UTC,16/17576,0,LOG,00000,"replication connection authorized: user=lrep_user application_name=pg_6169636_sync_5706390_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" 2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,3,"SELECT",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"duration: 0.421 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,4,"idle",2022-03-17 18:28:53 UTC,16/0,0,ERROR,42704,"replicationslot ""pg_6169636_sync_5706390_7073527418723003846"" does not exist",,,,,,"DROP_REPLICATION_SLOTpg_6169636_sync_5706390_7073527418723003846 WAIT",,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:53.926 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,5,"BEGIN",2022-03-17 18:28:53 UTC,16/17578,0,LOG,00000,"duration:0.039 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:53.958 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,6,"idle in transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waitingfor transactions (approximately 1) older than 160417372 to end.",,,,,"CREATE_REPLICATION_SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.154 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,7,"idle in transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found consistent point at 8FB/84D20960","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.169 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,8,"SELECT",2022-03-17 18:28:53UTC,16/17578,0,LOG,00000,"duration: 1.184 ms statement: SELECT c.oid, c.relreplident, c.relkind FROM pg_catalog.pg_classc INNER JOIN pg_catalog.pg_namespace n ON (c.relnamespace = n.oid) WHERE n.nspname = 'shard_qwerty' AND c.relname = 'sessions'",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.173 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,9,"SELECT",2022-03-17 18:28:53UTC,16/17578,0,LOG,00000,"duration: 2.354 ms statement: SELECT a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(163887255))WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated= '' AND a.attrelid = 163887255 ORDER BY a.attnum",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.185 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,10,"COPY",2022-03-17 18:28:53 UTC,16/17578,0,LOG,00000,"duration:11.738 ms statement: COPY shard_qwerty.sessions TO STDOUT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,11,"COMMIT",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"duration: 0.039 ms statement: COMMIT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,12,"idle",2022-03-17 18:28:53 UTC,16/0,0,LOG,00000,"startinglogical decoding for slot ""pg_6169636_sync_5706390_7073527418723003846""","Streaming transactionscommitting after 8FB/84D20998, reading WAL from 8FB/83EB4658.",,,,,"START_REPLICATION SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL 8FB/84D20998 (proto_version '1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17 18:28:53 UTC,16/0,0,LOG,00000,"logicaldecoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions (approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL 8FB/84D20998 (proto_version '1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" 2022-03-17 18:28:54.251 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,14,"idle",2022-03-17 18:28:53 UTC,16/0,0,LOG,00000,"logicaldecoding found consistent point at 8FB/84D20960","There are no running transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5706390_7073527418723003846"" LOGICAL 8FB/84D20998 (proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" #v- What can I do about it? Best regards, depesz
On Fri, Mar 18, 2022 at 5:58 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Thu, Mar 17, 2022 at 03:24:26PM +0100, hubert depesz lubaczewski wrote: > > Hi, > > I have following case: > > > > Primary: Ubuntu Bionic with Pg 12.9, and ~ 60k tables. > > Replica: Ubuntu Focal with Pg 14.2. > > Upgraded primary to 12.10, redid everything. Same problem. > > Interestingly, now when I look at pg_replication_slots, I see: > > #v+ > slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn > ---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+--------------------- > focal14 | pgoutput | logical | 16606 | dbname | f | f | | | 160417199 | 8FA/F601EF68 | 8FA/F87F13E8 > pg_6169636_sync_5600099_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17696| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5762839_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17697| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5953088_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17698| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5735455_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17699| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5873494_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3008 | 8FA/F88F3040 > pg_6169636_sync_5913846_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17700| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5664472_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17701| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5944971_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17702| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5887578_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17707| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5597814_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17703| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5706390_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | t | 17706| | 160478361 | 9C5/E5C65BD0 | 9C5/FF470D58 > pg_6169636_sync_5900789_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3158 | 8FA/F88F3190 > pg_6169636_sync_5719614_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3190 | 8FA/F88F31C8 > pg_6169636_sync_5754127_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F31C8 | 8FA/F88F3200 > pg_6169636_sync_5667313_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3200 | 8FA/F88F3238 > pg_6169636_sync_5904579_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3238 | 8FA/F88F3270 > pg_6169636_sync_5922710_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3270 | 8FA/F88F32A8 > pg_6169636_sync_5698143_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F32A8 | 8FA/F88F32E0 > pg_6169636_sync_5579916_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F32E0 | 8FA/F88F3318 > pg_6169636_sync_5938883_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3318 | 8FA/F88F3350 > pg_6169636_sync_5654015_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3350 | 8FA/F88F3388 > pg_6169636_sync_5762125_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3388 | 8FA/F88F33C0 > pg_6169636_sync_5760797_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F33C0 | 8FA/F88F33F8 > pg_6169636_sync_5851313_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F33F8 | 8FA/F88F3430 > pg_6169636_sync_5916213_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3430 | 8FA/F88F3468 > pg_6169636_sync_5999341_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3468 | 8FA/F88F34A0 > pg_6169636_sync_5896729_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 > pg_6169636_sync_5748694_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F34A0 | 8FA/F88F34D8 > pg_6169636_sync_5603063_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/782067B8 | 8FB/782067F0 > pg_6169636_sync_5859189_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3510 | 8FA/F88F3548 > pg_6169636_sync_5740988_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/782067F0 | 8FB/78206828 > pg_6169636_sync_5780890_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/78206828 | 8FB/78206860 > pg_6169636_sync_5900300_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/78206860 | 8FB/78206898 > pg_6169636_sync_5983861_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/78609D40 | 8FB/7A9D6028 > pg_6169636_sync_5704266_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417359 | 8FB/7A5C90B0 | 8FB/7A9D6028 > pg_6169636_sync_5950291_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD60C0 | 8FB/7AAD60F8 > pg_6169636_sync_5687359_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD60F8 | 8FB/7AAD6130 > pg_6169636_sync_5591454_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F37B8 | 8FA/F88F37F0 > pg_6169636_sync_5695446_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F37F0 | 8FA/F88F3828 > pg_6169636_sync_5654295_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3828 | 8FA/F88F3860 > pg_6169636_sync_5959324_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3828 | 8FA/F88F3860 > pg_6169636_sync_5653706_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F3898 | 8FA/F88F38D0 > pg_6169636_sync_5625296_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417208 | 8FA/F88F38D0 | 8FA/F88F3908 > pg_6169636_sync_5599947_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD6130 | 8FB/7AAD6168 > pg_6169636_sync_5900341_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD6168 | 8FB/7AAD61A0 > pg_6169636_sync_5657835_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD61A0 | 8FB/7AAD61D8 > pg_6169636_sync_5929106_7073527418723003846 | pgoutput | logical | 16606 | dbname | f | f | | | 160417362 | 8FB/7AAD61D8 | 8FB/7AAD6210 > (48 rows) > #v- > > These 10 "active" slots. When I checked their data in pg_stat_activity, it looks that these are stuck in "START_REPLICATIONSLOT": > > #v+ > =# select pid, application_name, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event, state,backend_xid, backend_xmin, query, backend_type from pg_stat_activity where pid in (select active_pid from pg_replication_slots where active); > pid │ application_name │ backend_start │ xact_start │ query_start │ state_change │ wait_event_type │ wait_event │ state │ backend_xid │ backend_xmin│ query │ backend_type > ═══════╪═════════════════════════════════════════════╪═══════════════════════════════╪════════════╪═══════════════════════════════╪═══════════════════════════════╪═════════════════╪═════════════════════╪════════╪═════════════╪══════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════ > 17696 │ pg_6169636_sync_5600099_7073527418723003846 │ 2022-03-17 18:28:53.83822+00 │ [null] │ 2022-03-17 18:28:54.178221+00│ 2022-03-17 18:28:54.178229+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5600099_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17697 │ pg_6169636_sync_5762839_7073527418723003846 │ 2022-03-17 18:28:53.840978+00 │ [null] │ 2022-03-17 18:28:54.182595+00│ 2022-03-17 18:28:54.182605+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5762839_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17698 │ pg_6169636_sync_5953088_7073527418723003846 │ 2022-03-17 18:28:53.851774+00 │ [null] │ 2022-03-17 18:28:54.182806+00│ 2022-03-17 18:28:54.182813+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5953088_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17699 │ pg_6169636_sync_5735455_7073527418723003846 │ 2022-03-17 18:28:53.862408+00 │ [null] │ 2022-03-17 18:28:54.181037+00│ 2022-03-17 18:28:54.181046+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5735455_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17700 │ pg_6169636_sync_5913846_7073527418723003846 │ 2022-03-17 18:28:53.872853+00 │ [null] │ 2022-03-17 18:28:54.179817+00│ 2022-03-17 18:28:54.179825+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5913846_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17701 │ pg_6169636_sync_5664472_7073527418723003846 │ 2022-03-17 18:28:53.883557+00 │ [null] │ 2022-03-17 18:28:54.175043+00│ 2022-03-17 18:28:54.175053+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5664472_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17707 │ pg_6169636_sync_5887578_7073527418723003846 │ 2022-03-17 18:28:53.948718+00 │ [null] │ 2022-03-17 18:28:54.176391+00│ 2022-03-17 18:28:54.1764+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5887578_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17702 │ pg_6169636_sync_5944971_7073527418723003846 │ 2022-03-17 18:28:53.894247+00 │ [null] │ 2022-03-17 18:28:54.177971+00│ 2022-03-17 18:28:54.177979+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5944971_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17703 │ pg_6169636_sync_5597814_7073527418723003846 │ 2022-03-17 18:28:53.90505+00 │ [null] │ 2022-03-17 18:28:54.183205+00│ 2022-03-17 18:28:54.183213+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5597814_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > 17706 │ pg_6169636_sync_5706390_7073527418723003846 │ 2022-03-17 18:28:53.915736+00 │ [null] │ 2022-03-17 18:28:54.186649+00│ 2022-03-17 18:28:54.186658+00 │ Client │ WalSenderWaitForWAL │ active │ [null] │ [null] │ START_REPLICATION SLOT "pg_6169636_sync_5706390_7073527418723003846" LOGICAL 8FB/84D20998 (proto_version '1', publication_names'"focal14"') │ walsender > (10 rows) > #v- > > Please note that now() when I run these queries is ~ 2022-03-18 12:19:36.902819+00 > > Which means that these backends are stuck for ~ 18 hours now. > > Description of wait event "WalSenderWaitForWAL" "Waiting for WAL to be flushed > in WAL sender process." doesn't seem to be OK, WAL writes are working OK: > > #v+ > =# select *, pg_current_wal_lsn() from pg_control_checkpoint() \gx > ─[ RECORD 1 ]────────┬───────────────────────── > checkpoint_lsn │ 9C7/3E1568B0 > redo_lsn │ 9C7/1F7988D0 > redo_wal_file │ 00000002000009C70000001F > timeline_id │ 2 > prev_timeline_id │ 2 > full_page_writes │ t > next_xid │ 0:160478730 > next_oid │ 310866725 > next_multixact_id │ 167477 > next_multi_offset │ 501565 > oldest_xid │ 479 > oldest_xid_dbid │ 13398 > oldest_active_xid │ 160478730 > oldest_multi_xid │ 1 > oldest_multi_dbid │ 16606 > oldest_commit_ts_xid │ 602 > newest_commit_ts_xid │ 160478729 > checkpoint_time │ 2022-03-18 12:17:03+00 > pg_current_wal_lsn │ 9C8/4EFC31E0 > #v- > > Archiving works, all seems to be well. But replication slots froze, and then > replica started bunch more. > > If that would help I can find all things related to these pids (of active > slots), but it might be too much. For now, one, random, pid: > > #v+ > 2022-03-17 18:16:54.684 UTC,,,17706,"10.2.210.180:44612",62337b16.452a,1,"",2022-03-17 18:16:54 UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=44612",,,,,,,,,"" > 2022-03-17 18:16:54.691 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,2,"authentication",2022-03-1718:16:54 UTC,11/57844,0,LOG,00000,"replicationconnection authorized: user=lrep_user application_name=pg_6169636_sync_5581717_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" > 2022-03-17 18:16:54.693 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,3,"SELECT",2022-03-17 18:16:54UTC,11/0,0,LOG,00000,"duration: 0.395 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.694 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,4,"BEGIN",2022-03-17 18:16:54UTC,11/57846,0,LOG,00000,"duration: 0.050 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.710 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,5,"idle in transaction",2022-03-1718:16:54 UTC,11/57846,0,LOG,00000,"logical decoding found consistent point at 8F8/8CAB5958","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT ""pg_6169636_sync_5581717_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.715 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,6,"SELECT",2022-03-17 18:16:54UTC,11/57846,0,LOG,00000,"duration: 1.130 ms statement: SELECT c.oid, c.relreplident, c.relkind FROM pg_catalog.pg_classc INNER JOIN pg_catalog.pg_namespace n ON (c.relnamespace = n.oid) WHERE n.nspname = 'shard_abcde' AND c.relname = 'alerts_data'",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.720 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,7,"SELECT",2022-03-17 18:16:54UTC,11/57846,0,LOG,00000,"duration: 4.753 ms statement: SELECT a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(164329186))WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated= '' AND a.attrelid = 164329186 ORDER BY a.attnum",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,8,"COPY",2022-03-17 18:16:54UTC,11/57846,0,LOG,00000,"duration: 1.878 ms statement: COPY shard_abcde.alerts_data TO STDOUT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.723 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,9,"COMMIT",2022-03-17 18:16:54UTC,11/0,0,LOG,00000,"duration: 0.037 ms statement: COMMIT",,,,,,,,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,10,"idle",2022-03-17 18:16:54UTC,11/0,0,LOG,00000,"starting logical decoding for slot ""pg_6169636_sync_5581717_7073527418723003846""","Streamingtransactions committing after 8F8/8CAB5990, reading WAL from 8F8/8CAB5958.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5581717_7073527418723003846"" LOGICAL 8F8/8CAB5990 (proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:16:54.788 UTC,"lrep_user","dbname",17706,"10.2.210.180:44612",62337b16.452a,11,"idle",2022-03-17 18:16:54UTC,11/0,0,LOG,00000,"logical decoding found consistent point at 8F8/8CAB5958","There are no running transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5581717_7073527418723003846"" LOGICAL 8F8/8CAB5990 (proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5581717_7073527418723003846" > 2022-03-17 18:28:53.915 UTC,,,17706,"10.2.210.180:35794",62337de5.452a,1,"",2022-03-17 18:28:53 UTC,,0,LOG,00000,"connectionreceived: host=10.2.210.180 port=35794",,,,,,,,,"" > 2022-03-17 18:28:53.923 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,2,"authentication",2022-03-1718:28:53 UTC,16/17576,0,LOG,00000,"replicationconnection authorized: user=lrep_user application_name=pg_6169636_sync_5706390_7073527418723003846SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384,bits=256, compression=off)",,,,,,,,,"" > 2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,3,"SELECT",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"duration: 0.421 ms statement: SELECT pg_catalog.set_config('search_path', '', false);",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:53.925 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,4,"idle",2022-03-17 18:28:53UTC,16/0,0,ERROR,42704,"replication slot ""pg_6169636_sync_5706390_7073527418723003846"" does not exist",,,,,,"DROP_REPLICATION_SLOTpg_6169636_sync_5706390_7073527418723003846 WAIT",,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:53.926 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,5,"BEGIN",2022-03-17 18:28:53UTC,16/17578,0,LOG,00000,"duration: 0.039 ms statement: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:53.958 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,6,"idle in transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waitingfor transactions (approximately 1) older than 160417372 to end.",,,,,"CREATE_REPLICATION_SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.154 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,7,"idle in transaction",2022-03-1718:28:53 UTC,16/17578,0,LOG,00000,"logical decoding found consistent point at 8FB/84D20960","Thereare no running transactions.",,,,,"CREATE_REPLICATION_SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL pgoutput USE_SNAPSHOT",,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.169 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,8,"SELECT",2022-03-17 18:28:53UTC,16/17578,0,LOG,00000,"duration: 1.184 ms statement: SELECT c.oid, c.relreplident, c.relkind FROM pg_catalog.pg_classc INNER JOIN pg_catalog.pg_namespace n ON (c.relnamespace = n.oid) WHERE n.nspname = 'shard_qwerty' AND c.relname = 'sessions'",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.173 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,9,"SELECT",2022-03-17 18:28:53UTC,16/17578,0,LOG,00000,"duration: 2.354 ms statement: SELECT a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(163887255))WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated= '' AND a.attrelid = 163887255 ORDER BY a.attnum",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.185 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,10,"COPY",2022-03-17 18:28:53UTC,16/17578,0,LOG,00000,"duration: 11.738 ms statement: COPY shard_qwerty.sessions TO STDOUT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,11,"COMMIT",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"duration: 0.039 ms statement: COMMIT",,,,,,,,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.186 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,12,"idle",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"starting logical decoding for slot ""pg_6169636_sync_5706390_7073527418723003846""","Streamingtransactions committing after 8FB/84D20998, reading WAL from 8FB/83EB4658.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5706390_7073527418723003846"" LOGICAL 8FB/84D20998 (proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions (approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT ""pg_6169636_sync_5706390_7073527418723003846""LOGICAL 8FB/84D20998 (proto_version '1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" > 2022-03-17 18:28:54.251 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,14,"idle",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"logical decoding found consistent point at 8FB/84D20960","There are no running transactions.",,,,,"START_REPLICATIONSLOT ""pg_6169636_sync_5706390_7073527418723003846"" LOGICAL 8FB/84D20998 (proto_version'1', publication_names '""focal14""')",,,"pg_6169636_sync_5706390_7073527418723003846" > #v- > > What can I do about it? >2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48 UTC,8/447946,23877409,ERROR,53400,"could notfind free replication state slot for replication origin with OID 51",,"Increase max_replication_slots and try again.",,,,,,,"","logicalreplication worker",,0 Each Tablesync worker needs a replication slot to synchronize the data to the subscriber, this error means there are not enough replication slots. Try to increase max_replication_slots configuration if it is set to lower value as you are using around 60k tables. > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions (approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT Snapshot builder is waiting for the open transaction to finish. Is it possible that there are some open transactions that were started during this logical replication. I felt once the transaction completes, snapshot building and tablesync operation should finish. Regards, Vignesh
Re: Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
On Mon, Mar 21, 2022 at 07:48:20PM +0530, vignesh C wrote: > > What can I do about it? > > >2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48 UTC,8/447946,23877409,ERROR,53400,"couldnot find free replication state slot for replication origin with OID 51",,"Increasemax_replication_slots and try again.",,,,,,,"","logical replication worker",,0 > Each Tablesync worker needs a replication slot to synchronize the data > to the subscriber, this error means there are not enough replication > slots. Try to increase max_replication_slots configuration if it is > set to lower value as you are using around 60k tables. yeah, but I don't need 60k replication slots? Please note that replication (initial sync) *IS* working happily for *some* time, and only then starts "eating" replication slots for whatever reason. Specifically - I have max_replication_slots set to 50 (per my original mail), and sync workers shouldn't exceed 10. > > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions (approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT > > Snapshot builder is waiting for the open transaction to finish. Is it > possible that there are some open transactions that were started > during this logical replication. I felt once the transaction > completes, snapshot building and tablesync operation should finish. There are no long open transacitons there. This is test box with *zero* traffic (aside from running in a loop: create table, drop table), but *NOT* in transaction. Best regards, depesz
On Mon, Mar 21, 2022 at 8:13 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Mon, Mar 21, 2022 at 07:48:20PM +0530, vignesh C wrote: > > > What can I do about it? > > > > >2022-03-17 13:32:48.119 UTC,,,2923553,,62333880.2c9c21,2,,2022-03-17 13:32:48 UTC,8/447946,23877409,ERROR,53400,"couldnot find free replication state slot for replication origin with OID 51",,"Increasemax_replication_slots and try again.",,,,,,,"","logical replication worker",,0 > > Each Tablesync worker needs a replication slot to synchronize the data > > to the subscriber, this error means there are not enough replication > > slots. Try to increase max_replication_slots configuration if it is > > set to lower value as you are using around 60k tables. > > yeah, but I don't need 60k replication slots? > > Please note that replication (initial sync) *IS* working happily for > *some* time, and only then starts "eating" replication slots for > whatever reason. > > Specifically - I have max_replication_slots set to 50 (per my original > mail), and sync workers shouldn't exceed 10. > > > > 2022-03-17 18:28:54.189 UTC,"lrep_user","dbname",17706,"10.2.210.180:35794",62337de5.452a,13,"idle",2022-03-17 18:28:53UTC,16/0,0,LOG,00000,"logical decoding found initial starting point at 8FB/83FCA7D8","Waiting for transactions (approximately1) older than 160417372 to end.",,,,,"START_REPLICATION SLOT > > > > Snapshot builder is waiting for the open transaction to finish. Is it > > possible that there are some open transactions that were started > > during this logical replication. I felt once the transaction > > completes, snapshot building and tablesync operation should finish. > > There are no long open transacitons there. This is test box with *zero* > traffic (aside from running in a loop: create table, drop table), but > *NOT* in transaction. I could simulate the similar problem by starting a new transaction while creating a subscription and notice the following log content: 2022-03-22 08:00:17.868 IST [2680] LOG: logical decoding found initial starting point at 0/427DC3D0 2022-03-22 08:00:17.868 IST [2680] DETAIL: Waiting for transactions (approximately 1) older than 748 to end. 2022-03-22 08:00:17.868 IST [2680] STATEMENT: CREATE_REPLICATION_SLOT "pg_16439_sync_16403_7077745941166836171" LOGICAL pgoutput (SNAPSHOT 'use') I could see that the following transaction was open(older than 748): 5 | postgres | 2653 | | 10 | vignesh | psql | | | -1 | 2022-03-22 07:59:46.147802+05:30 | 2022-03-22 08:00:17.394259+ 05:30 | 2022-03-22 08:00:23.024178+05:30 | 2022-03-22 08:00:23.024181+05:30 | | | active | 747 | 747 | | select * from pg_stat_activity ; | client backend 5 | postgres | 2665 | | | | | | | | 2022-03-22 08:00:03.005628+05:30 | 2022-03-22 08:00:21.653243+ 05:30 | 2022-03-22 08:00:21.653243+05:30 | 2022-03-22 08:00:21.653243+05:30 | Timeout | VacuumDelay | active | | 747 | | autovacuum: VACUUM ANALYZE sch1.t4 | autovacuum worker Similarly I could see it from pg_locks: transactionid | | | | | | 747 | | | | 6/11 | 2682 | ShareLock | f | f | 2022-03-22 08:00:17.932021+05:3 0 relation | 0 | 1262 | | | | | | | | 3/41 | 2653 | AccessShareLock | t | f | transactionid | | | | | | 747 | | | | 7/11 | 2680 | ShareLock | f | f | 2022-03-22 08:00:17.868615+05:3 0 transactionid | | | | | | 747 | | | | 3/41 | 2653 | ExclusiveLock | t | f | I'm not sure if this is the same scenario or not. Could you also confirm from pg_stat_activity and pg_locks if there are any open transactions and if there is any open transaction, could you see this transaction is on which relation and are these because of the table sync workers? Also as an alternative you might be able to use pg_dump to take a backup and restore it to the destination if there are no transactions and create subscription with copy_data as off. Regards, Vignesh
Re: Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
On Tue, Mar 22, 2022 at 08:37:07AM +0530, vignesh C wrote: > I could simulate the similar problem by starting a new transaction > while creating a subscription and notice the following log content: ... > I'm not sure if this is the same scenario or not. > Could you also confirm from pg_stat_activity and pg_locks if there are > any open transactions and if there is any open transaction, could you > see this transaction is on which relation and are these because of the > table sync workers? I don't think it is the same scenario because this is totally isolated server, with no applications running, cron stopped and no access from outside whatsoever. Also, even if I'd assume that some open transaction can pause initial sync, which is reasonable, why would it start to leave open replication slots, and start new ones, above number specified in max_sync_workers_per_subscription? > Also as an alternative you might be able to use pg_dump to take a > backup and restore it to the destination if there are no transactions > and create subscription with copy_data as off. That wouldn't do me any good. The server I'm trying to upgrade now is test bench to make sure that I can upgrade using logical replication *at all*. Once I will have verified that I can, I have couple hundred servers to upgrade, that will have concurrent access all the time. If the problem was: upgrade server that doesn't have any traffic, I could just pg_dump + pg_restore, and be done with it. Unfortunately this is not the case. Best regards, depesz
On Tue, Mar 22, 2022 at 9:18 AM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Tue, Mar 22, 2022 at 08:37:07AM +0530, vignesh C wrote: > > I could simulate the similar problem by starting a new transaction > > while creating a subscription and notice the following log content: > ... > > I'm not sure if this is the same scenario or not. > > Could you also confirm from pg_stat_activity and pg_locks if there are > > any open transactions and if there is any open transaction, could you > > see this transaction is on which relation and are these because of the > > table sync workers? > > I don't think it is the same scenario because this is totally isolated > server, with no applications running, cron stopped and no access from > outside whatsoever. > Also, even if I'd assume that some open transaction can pause initial > sync, which is reasonable, why would it start to leave open replication > slots, and start new ones, above number specified in > max_sync_workers_per_subscription? > > > Also as an alternative you might be able to use pg_dump to take a > > backup and restore it to the destination if there are no transactions > > and create subscription with copy_data as off. > > That wouldn't do me any good. The server I'm trying to upgrade now is > test bench to make sure that I can upgrade using logical replication *at > all*. > Once I will have verified that I can, I have couple hundred servers to > upgrade, that will have concurrent access all the time. Is it possible to get a reproducible test or a script for this problem? Regards, Vignesh
Re: Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote: > > That wouldn't do me any good. The server I'm trying to upgrade now is > > test bench to make sure that I can upgrade using logical replication *at > > all*. > > Once I will have verified that I can, I have couple hundred servers to > > upgrade, that will have concurrent access all the time. > Is it possible to get a reproducible test or a script for this problem? Well, given that we're talking about 38GB database with 60k tables i find that unlikely. But I'll try if I can figure out a way to replicate the problem with something that is shareable. depesz
Re: Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote: > On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote: > > > That wouldn't do me any good. The server I'm trying to upgrade now is > > > test bench to make sure that I can upgrade using logical replication *at > > > all*. > > > Once I will have verified that I can, I have couple hundred servers to > > > upgrade, that will have concurrent access all the time. > > Is it possible to get a reproducible test or a script for this problem? > > Well, given that we're talking about 38GB database with 60k tables > i find that unlikely. > > But I'll try if I can figure out a way to replicate the problem with > something that is shareable. OK. I don't think it will be simple/possible to replicate. Reason: Did setup 10 separate environments, all with the same database on bionic 12.9. Then I created publication with all tables in specific schemas, and subscribed to it from separate focal pg 14.2 servers. In total, used 20 servers to replicate the test 10 times. Effect: in one case replication didn't finish initial sync. Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that had failed with subscription i see: slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn ---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+--------------------- focal14 | pgoutput | logical | 16606 | dbname | f | f | | | 161822728 | 8E9/3E70BB58 | 8E9/3E70BB58 pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32385| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32378| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32386| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32387| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32379| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32380| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32381| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32382| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32383| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32384| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 (11 rows) pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn, write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is changing. pg_stat_activity contains: =# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle'; xact_start │ query_start │ backend_type │ query ═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ [null] │ [null] │ autovacuum launcher │ [null] │ 2022-03-24 20:45:30.757582+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_2993718_7077521990523695096"LOGICAL 8E9/40FF0170 (proto_version '1', publication_names '"focal14"') [null] │ [null] │ pglogical supervisor │ [null] │ [null] │ logical replication launcher │ [null] │ 2022-03-24 20:45:30.730818+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3343043_7077521990523695096"LOGICAL 8E9/40FF0100 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:31.65015+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3079712_7077521990523695096"LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:30.733366+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3145162_7077521990523695096"LOGICAL 8E9/40FF0138 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:30.76285+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3355696_7077521990523695096"LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:31.663299+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3099460_7077521990523695096"LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:31.676412+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3162651_7077521990523695096"LOGICAL 8E9/40FF0020 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:31.691807+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3157454_7077521990523695096"LOGICAL 8E9/40FF0058 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:31.69548+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3084198_7077521990523695096"LOGICAL 8E9/40FF0090 (proto_version '1', publication_names '"focal14"') [null] │ 2022-03-24 20:45:30.974833+00 │ walsender │ START_REPLICATION SLOT "pg_3548429_sync_3142573_7077521990523695096"LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names '"focal14"') 2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend │ select xact_start, query_start,backend_type, query from pg_stat_activity where state is distinct from 'idle'; [null] │ [null] │ background writer │ [null] │ [null] │ checkpointer │ [null] │ [null] │ walwriter │ (17 rows) So, it looks to me that there is some random-ish condition that causes the problem. I tried to insulate pgs from the world as well as I could: no monitoring, no cron, no external access. Just replication and my simple load generator, running loop: CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i; CREATE INDEX depesz_test_idx on depesz_test(i); DROP TABLE depesz_test; every minute. I'm out of ideas. The good-ish news is that when I tried replication in smaller sets (30k tables, not 60k), and/or with just 2 max_sync_workers_per_subscription, and not 10, it never failed. so I guess I'll go that way. Best regards, depesz
On Fri, Mar 25, 2022 at 2:09 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote: > > On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote: > > > > That wouldn't do me any good. The server I'm trying to upgrade now is > > > > test bench to make sure that I can upgrade using logical replication *at > > > > all*. > > > > Once I will have verified that I can, I have couple hundred servers to > > > > upgrade, that will have concurrent access all the time. > > > Is it possible to get a reproducible test or a script for this problem? > > > > Well, given that we're talking about 38GB database with 60k tables > > i find that unlikely. > > > > But I'll try if I can figure out a way to replicate the problem with > > something that is shareable. > > OK. I don't think it will be simple/possible to replicate. > > Reason: > > Did setup 10 separate environments, all with the same database on bionic > 12.9. > > Then I created publication with all tables in specific schemas, and > subscribed to it from separate focal pg 14.2 servers. > > In total, used 20 servers to replicate the test 10 times. > > Effect: in one case replication didn't finish initial sync. > > Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that > had failed with subscription i see: > slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid| xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn > ---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+--------------------- > focal14 | pgoutput | logical | 16606 | dbname | f | f | | | 161822728 | 8E9/3E70BB58 | 8E9/3E70BB58 > pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32385| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32378| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32386| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32387| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32379| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32380| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32381| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32382| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32383| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical | 16606 | dbname | f | t | 32384| | 161824826 | 8F0/18327650 | 8F0/1ABF89C0 > (11 rows) > > pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn, > write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is > changing. > > pg_stat_activity contains: > > =# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle'; > xact_start │ query_start │ backend_type │ query > ═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ > [null] │ [null] │ autovacuum launcher │ > [null] │ 2022-03-24 20:45:30.757582+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_2993718_7077521990523695096" LOGICAL 8E9/40FF0170 (proto_version '1', publication_names '"focal14"') > [null] │ [null] │ pglogical supervisor │ > [null] │ [null] │ logical replication launcher │ > [null] │ 2022-03-24 20:45:30.730818+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3343043_7077521990523695096" LOGICAL 8E9/40FF0100 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:31.65015+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3079712_7077521990523695096" LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:30.733366+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3145162_7077521990523695096" LOGICAL 8E9/40FF0138 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:30.76285+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3355696_7077521990523695096" LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:31.663299+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3099460_7077521990523695096" LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:31.676412+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3162651_7077521990523695096" LOGICAL 8E9/40FF0020 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:31.691807+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3157454_7077521990523695096" LOGICAL 8E9/40FF0058 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:31.69548+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3084198_7077521990523695096" LOGICAL 8E9/40FF0090 (proto_version '1', publication_names '"focal14"') > [null] │ 2022-03-24 20:45:30.974833+00 │ walsender │ START_REPLICATION SLOT"pg_3548429_sync_3142573_7077521990523695096" LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names '"focal14"') > 2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend │ select xact_start, query_start,backend_type, query from pg_stat_activity where state is distinct from 'idle'; > [null] │ [null] │ background writer │ > [null] │ [null] │ checkpointer │ > [null] │ [null] │ walwriter │ > (17 rows) > > So, it looks to me that there is some random-ish condition that causes > the problem. > > I tried to insulate pgs from the world as well as I could: no > monitoring, no cron, no external access. Just replication and my simple > load generator, running loop: > > CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i; > CREATE INDEX depesz_test_idx on depesz_test(i); > DROP TABLE depesz_test; > > every minute. > > I'm out of ideas. The good-ish news is that when I tried replication in > smaller sets (30k tables, not 60k), and/or with just > 2 max_sync_workers_per_subscription, and not 10, it never failed. so > I guess I'll go that way. > What was your setting for max_logical_replication_workers? It should be more than max_sync_workers_per_subscription as that is taken from the pool of max_logical_replication_workers. One possible reason for some inactive slots could be that some tablesync workers exit without finishing the initial sync and in the meantime the new tablesync workers started and the old ones didn't get time to finish. Eventually, you will get the error: "Increase max_replication_slots and try again" because we also have restrictions on the number of origins equal to max_replication_slots. We create one origin for each copy. It is good if you are able to proceed but we might get some clue about the problem you are seeing if you can show subscriber-side logs. -- With Regards, Amit Kapila.
Re: Logical replication stops dropping used initial-sync replication slots
From
hubert depesz lubaczewski
Date:
On Tue, Mar 29, 2022 at 04:57:42PM +0530, Amit Kapila wrote: > What was your setting for max_logical_replication_workers? It should 4 on source, and 50 on replica. That could be related. Should I increase this on source? > be more than max_sync_workers_per_subscription as that is taken from > the pool of max_logical_replication_workers. One possible reason for The thing is that, in next tests, it seemed to *usually* work. And break just sometimes. > It is good if you are able to proceed but we might get some clue about > the problem you are seeing if you can show subscriber-side logs. Will keep that in mind. The boxes I was testing it are gone, but if the problem will resurface, I'll make the logs available. Best regards, depesz
On Tue, Mar 29, 2022 at 5:31 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Tue, Mar 29, 2022 at 04:57:42PM +0530, Amit Kapila wrote: > > What was your setting for max_logical_replication_workers? It should > > 4 on source, and 50 on replica. That could be related. Should I increase this on source? > No that won't matter. The number on the replica only matters which seems to be okay in your case. > > be more than max_sync_workers_per_subscription as that is taken from > > the pool of max_logical_replication_workers. One possible reason for > > The thing is that, in next tests, it seemed to *usually* work. And break > just sometimes. > It is possible that some random error is occurring during the initial copy, difficult to say without seeing the actual error. > > It is good if you are able to proceed but we might get some clue about > > the problem you are seeing if you can show subscriber-side logs. > > Will keep that in mind. The boxes I was testing it are gone, but if the > problem will resurface, I'll make the logs available. > Thanks, that might help investigate the problem you are facing. -- With Regards, Amit Kapila.