DROP DATABASE deadlocks with logical replication worker in PG 15.1 - Mailing list pgsql-bugs
From | Lakshmi Narayanan Sreethar |
---|---|
Subject | DROP DATABASE deadlocks with logical replication worker in PG 15.1 |
Date | |
Msg-id | CA+kvmZELXQ4ZD3U=XCXuG3KvFgkuPoN1QrEj8c-rMRodrLOnsg@mail.gmail.com Whole thread Raw |
Responses |
Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
|
List | pgsql-bugs |
When setting logical replication between two tables from different databases running in the same postgres instance, if a "DROP DATABASE" command is executed immediately after the "CREATE SUBSCRIPTION" command, the backend executing DROP DATABASE ends up in a deadlock with the logical replication table synchronization worker. I have attached a .sql file with this report which can be run through the psql frontend to reproduce the deadlock.
Once the deadlock occurs, the postmaster log gets spammed with the following text repeatedly :
2023-01-13 20:32:28.560 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:28.560 IST [66199] STATEMENT: DROP DATABASE dummy;
2023-01-13 20:32:33.565 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:33.565 IST [66199] STATEMENT: DROP DATABASE dummy;
....
where the process with PID 66203 is the logical replication worker.
The worker itself is waiting for a "CREATE_REPLICATION_SLOT" command to complete.
gdb backtrace from PID 66203 :
#0 0x00007fa636c39057 in epoll_wait () from /usr/lib/libc.so.6
#1 0x0000559cccc7da17 in WaitEventSetWaitBlock (set=0x559cce247da0, cur_timeout=-1, occurred_events=0x7fffb26f7430, nevents=1)
at /pg15.1/src/backend/storage/ipc/latch.c:1489
#2 0x0000559cccc7d7c7 in WaitEventSetWait (set=0x559cce247da0, timeout=-1, occurred_events=0x7fffb26f7430, nevents=1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:1435
#3 0x0000559cccc7c343 in WaitLatchOrSocket (latch=0x7fa6364074d4, wakeEvents=35, sock=5, timeout=-1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:562
#4 0x00007fa6373761df in libpqrcv_PQgetResult (streamConn=0x559cce2496f0)
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:708
#5 0x00007fa63737602b in libpqrcv_PQexec (streamConn=0x559cce2496f0,
query=0x559cce2480e8 "CREATE_REPLICATION_SLOT \"pg_16403_sync_16396_7188141898123190919\" LOGICAL pgoutput (SNAPSHOT 'use')")
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:668
#6 0x00007fa6373770da in libpqrcv_create_slot (conn=0x559cce13bc58, slotname=0x559cce1a6ac8 "pg_16403_sync_16396_7188141898123190919", temporary=false, two_phase=false,
snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7fffb26f7728) at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:944
#7 0x0000559cccbe1e71 in LogicalRepSyncTableStart (origin_startpos=0x7fffb26f7728) at /pg15.1/src/backend/replication/logical/tablesync.c:1353
#8 0x0000559cccbedcb0 in start_table_sync (origin_startpos=0x7fffb26f7728, myslotname=0x7fffb26f7730)
at /pg15.1/src/backend/replication/logical/worker.c:3525
#9 0x0000559cccbee712 in ApplyWorkerMain (main_arg=1) at /pg15.1/src/backend/replication/logical/worker.c:3688
#10 0x0000559cccb69cb1 in StartBackgroundWorker () at /pg15.1/src/backend/postmaster/bgworker.c:858
#11 0x0000559cccb7ed6f in do_start_bgworker (rw=0x559cce1a2190) at /pg15.1/src/backend/postmaster/postmaster.c:5823
#12 0x0000559cccb7f50b in maybe_start_bgworkers () at /pg15.1/src/backend/postmaster/postmaster.c:6047
#13 0x0000559cccb7d2ae in sigusr1_handler (postgres_signal_arg=10) at /pg15.1/src/backend/postmaster/postmaster.c:5204
#14 <signal handler called>
#15 0x00007fa636c2e804 in select () from /usr/lib/libc.so.6
#16 0x0000559cccb740f4 in ServerLoop () at /pg15.1/src/backend/postmaster/postmaster.c:1770
#17 0x0000559cccb73570 in PostmasterMain (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/postmaster/postmaster.c:1478
#18 0x0000559ccc9bf1bc in main (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/main/main.c:202
Relevant pg_stat_activity output :
db_publisher=# select pid, datname, query from pg_stat_activity order by pid;
pid | datname | query
-------+---------------+--------------------------------------------------------------------------------------------------------------------
66190 | |
66191 | |
66193 | |
66194 | |
66195 | |
66199 | db_subscriber | DROP DATABASE dummy;
66201 | db_subscriber |
66202 | db_publisher | START_REPLICATION SLOT "test_subscription" LOGICAL 0/0 (proto_version '3', publication_names '"test_publication"')
66203 | db_subscriber |
66204 | db_publisher | CREATE_REPLICATION_SLOT "pg_16403_sync_16396_7188141898123190919" LOGICAL pgoutput (SNAPSHOT 'use')
66224 | db_publisher | select pid, datname, query from pg_stat_activity order by pid;
(11 rows)
PID 66204 (CREATE_REPLICATION_SLOT) is waiting for a transaction lock, which in turn is held by the process executing DROP DATABASE (PID 66199) thus causing the deadlock.
Relevant output from pg_locks :
db_publisher=# select locktype, transactionid, pid, granted from pg_locks where transactionid = 732;
locktype | transactionid | pid | granted
---------------+---------------+-------+---------
transactionid | 732 | 66199 | t
transactionid | 732 | 66204 | f
(2 rows)
Regards
Once the deadlock occurs, the postmaster log gets spammed with the following text repeatedly :
2023-01-13 20:32:28.560 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:28.560 IST [66199] STATEMENT: DROP DATABASE dummy;
2023-01-13 20:32:33.565 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:33.565 IST [66199] STATEMENT: DROP DATABASE dummy;
....
where the process with PID 66203 is the logical replication worker.
The worker itself is waiting for a "CREATE_REPLICATION_SLOT" command to complete.
gdb backtrace from PID 66203 :
#0 0x00007fa636c39057 in epoll_wait () from /usr/lib/libc.so.6
#1 0x0000559cccc7da17 in WaitEventSetWaitBlock (set=0x559cce247da0, cur_timeout=-1, occurred_events=0x7fffb26f7430, nevents=1)
at /pg15.1/src/backend/storage/ipc/latch.c:1489
#2 0x0000559cccc7d7c7 in WaitEventSetWait (set=0x559cce247da0, timeout=-1, occurred_events=0x7fffb26f7430, nevents=1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:1435
#3 0x0000559cccc7c343 in WaitLatchOrSocket (latch=0x7fa6364074d4, wakeEvents=35, sock=5, timeout=-1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:562
#4 0x00007fa6373761df in libpqrcv_PQgetResult (streamConn=0x559cce2496f0)
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:708
#5 0x00007fa63737602b in libpqrcv_PQexec (streamConn=0x559cce2496f0,
query=0x559cce2480e8 "CREATE_REPLICATION_SLOT \"pg_16403_sync_16396_7188141898123190919\" LOGICAL pgoutput (SNAPSHOT 'use')")
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:668
#6 0x00007fa6373770da in libpqrcv_create_slot (conn=0x559cce13bc58, slotname=0x559cce1a6ac8 "pg_16403_sync_16396_7188141898123190919", temporary=false, two_phase=false,
snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7fffb26f7728) at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:944
#7 0x0000559cccbe1e71 in LogicalRepSyncTableStart (origin_startpos=0x7fffb26f7728) at /pg15.1/src/backend/replication/logical/tablesync.c:1353
#8 0x0000559cccbedcb0 in start_table_sync (origin_startpos=0x7fffb26f7728, myslotname=0x7fffb26f7730)
at /pg15.1/src/backend/replication/logical/worker.c:3525
#9 0x0000559cccbee712 in ApplyWorkerMain (main_arg=1) at /pg15.1/src/backend/replication/logical/worker.c:3688
#10 0x0000559cccb69cb1 in StartBackgroundWorker () at /pg15.1/src/backend/postmaster/bgworker.c:858
#11 0x0000559cccb7ed6f in do_start_bgworker (rw=0x559cce1a2190) at /pg15.1/src/backend/postmaster/postmaster.c:5823
#12 0x0000559cccb7f50b in maybe_start_bgworkers () at /pg15.1/src/backend/postmaster/postmaster.c:6047
#13 0x0000559cccb7d2ae in sigusr1_handler (postgres_signal_arg=10) at /pg15.1/src/backend/postmaster/postmaster.c:5204
#14 <signal handler called>
#15 0x00007fa636c2e804 in select () from /usr/lib/libc.so.6
#16 0x0000559cccb740f4 in ServerLoop () at /pg15.1/src/backend/postmaster/postmaster.c:1770
#17 0x0000559cccb73570 in PostmasterMain (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/postmaster/postmaster.c:1478
#18 0x0000559ccc9bf1bc in main (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/main/main.c:202
Relevant pg_stat_activity output :
db_publisher=# select pid, datname, query from pg_stat_activity order by pid;
pid | datname | query
-------+---------------+--------------------------------------------------------------------------------------------------------------------
66190 | |
66191 | |
66193 | |
66194 | |
66195 | |
66199 | db_subscriber | DROP DATABASE dummy;
66201 | db_subscriber |
66202 | db_publisher | START_REPLICATION SLOT "test_subscription" LOGICAL 0/0 (proto_version '3', publication_names '"test_publication"')
66203 | db_subscriber |
66204 | db_publisher | CREATE_REPLICATION_SLOT "pg_16403_sync_16396_7188141898123190919" LOGICAL pgoutput (SNAPSHOT 'use')
66224 | db_publisher | select pid, datname, query from pg_stat_activity order by pid;
(11 rows)
PID 66204 (CREATE_REPLICATION_SLOT) is waiting for a transaction lock, which in turn is held by the process executing DROP DATABASE (PID 66199) thus causing the deadlock.
Relevant output from pg_locks :
db_publisher=# select locktype, transactionid, pid, granted from pg_locks where transactionid = 732;
locktype | transactionid | pid | granted
---------------+---------------+-------+---------
transactionid | 732 | 66199 | t
transactionid | 732 | 66204 | f
(2 rows)
Regards
Lakshmi Narayanan
Attachment
pgsql-bugs by date: