Thread: DROP DATABASE deadlocks with logical replication worker in PG 15.1
DROP DATABASE deadlocks with logical replication worker in PG 15.1
From
Lakshmi Narayanan Sreethar
Date:
When setting logical replication between two tables from different databases running in the same postgres instance, if a "DROP DATABASE" command is executed immediately after the "CREATE SUBSCRIPTION" command, the backend executing DROP DATABASE ends up in a deadlock with the logical replication table synchronization worker. I have attached a .sql file with this report which can be run through the psql frontend to reproduce the deadlock.
Once the deadlock occurs, the postmaster log gets spammed with the following text repeatedly :
2023-01-13 20:32:28.560 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:28.560 IST [66199] STATEMENT: DROP DATABASE dummy;
2023-01-13 20:32:33.565 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:33.565 IST [66199] STATEMENT: DROP DATABASE dummy;
....
where the process with PID 66203 is the logical replication worker.
The worker itself is waiting for a "CREATE_REPLICATION_SLOT" command to complete.
gdb backtrace from PID 66203 :
#0 0x00007fa636c39057 in epoll_wait () from /usr/lib/libc.so.6
#1 0x0000559cccc7da17 in WaitEventSetWaitBlock (set=0x559cce247da0, cur_timeout=-1, occurred_events=0x7fffb26f7430, nevents=1)
at /pg15.1/src/backend/storage/ipc/latch.c:1489
#2 0x0000559cccc7d7c7 in WaitEventSetWait (set=0x559cce247da0, timeout=-1, occurred_events=0x7fffb26f7430, nevents=1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:1435
#3 0x0000559cccc7c343 in WaitLatchOrSocket (latch=0x7fa6364074d4, wakeEvents=35, sock=5, timeout=-1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:562
#4 0x00007fa6373761df in libpqrcv_PQgetResult (streamConn=0x559cce2496f0)
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:708
#5 0x00007fa63737602b in libpqrcv_PQexec (streamConn=0x559cce2496f0,
query=0x559cce2480e8 "CREATE_REPLICATION_SLOT \"pg_16403_sync_16396_7188141898123190919\" LOGICAL pgoutput (SNAPSHOT 'use')")
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:668
#6 0x00007fa6373770da in libpqrcv_create_slot (conn=0x559cce13bc58, slotname=0x559cce1a6ac8 "pg_16403_sync_16396_7188141898123190919", temporary=false, two_phase=false,
snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7fffb26f7728) at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:944
#7 0x0000559cccbe1e71 in LogicalRepSyncTableStart (origin_startpos=0x7fffb26f7728) at /pg15.1/src/backend/replication/logical/tablesync.c:1353
#8 0x0000559cccbedcb0 in start_table_sync (origin_startpos=0x7fffb26f7728, myslotname=0x7fffb26f7730)
at /pg15.1/src/backend/replication/logical/worker.c:3525
#9 0x0000559cccbee712 in ApplyWorkerMain (main_arg=1) at /pg15.1/src/backend/replication/logical/worker.c:3688
#10 0x0000559cccb69cb1 in StartBackgroundWorker () at /pg15.1/src/backend/postmaster/bgworker.c:858
#11 0x0000559cccb7ed6f in do_start_bgworker (rw=0x559cce1a2190) at /pg15.1/src/backend/postmaster/postmaster.c:5823
#12 0x0000559cccb7f50b in maybe_start_bgworkers () at /pg15.1/src/backend/postmaster/postmaster.c:6047
#13 0x0000559cccb7d2ae in sigusr1_handler (postgres_signal_arg=10) at /pg15.1/src/backend/postmaster/postmaster.c:5204
#14 <signal handler called>
#15 0x00007fa636c2e804 in select () from /usr/lib/libc.so.6
#16 0x0000559cccb740f4 in ServerLoop () at /pg15.1/src/backend/postmaster/postmaster.c:1770
#17 0x0000559cccb73570 in PostmasterMain (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/postmaster/postmaster.c:1478
#18 0x0000559ccc9bf1bc in main (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/main/main.c:202
Relevant pg_stat_activity output :
db_publisher=# select pid, datname, query from pg_stat_activity order by pid;
pid | datname | query
-------+---------------+--------------------------------------------------------------------------------------------------------------------
66190 | |
66191 | |
66193 | |
66194 | |
66195 | |
66199 | db_subscriber | DROP DATABASE dummy;
66201 | db_subscriber |
66202 | db_publisher | START_REPLICATION SLOT "test_subscription" LOGICAL 0/0 (proto_version '3', publication_names '"test_publication"')
66203 | db_subscriber |
66204 | db_publisher | CREATE_REPLICATION_SLOT "pg_16403_sync_16396_7188141898123190919" LOGICAL pgoutput (SNAPSHOT 'use')
66224 | db_publisher | select pid, datname, query from pg_stat_activity order by pid;
(11 rows)
PID 66204 (CREATE_REPLICATION_SLOT) is waiting for a transaction lock, which in turn is held by the process executing DROP DATABASE (PID 66199) thus causing the deadlock.
Relevant output from pg_locks :
db_publisher=# select locktype, transactionid, pid, granted from pg_locks where transactionid = 732;
locktype | transactionid | pid | granted
---------------+---------------+-------+---------
transactionid | 732 | 66199 | t
transactionid | 732 | 66204 | f
(2 rows)
Regards
Once the deadlock occurs, the postmaster log gets spammed with the following text repeatedly :
2023-01-13 20:32:28.560 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:28.560 IST [66199] STATEMENT: DROP DATABASE dummy;
2023-01-13 20:32:33.565 IST [66199] LOG: still waiting for backend with PID 66203 to accept ProcSignalBarrier
2023-01-13 20:32:33.565 IST [66199] STATEMENT: DROP DATABASE dummy;
....
where the process with PID 66203 is the logical replication worker.
The worker itself is waiting for a "CREATE_REPLICATION_SLOT" command to complete.
gdb backtrace from PID 66203 :
#0 0x00007fa636c39057 in epoll_wait () from /usr/lib/libc.so.6
#1 0x0000559cccc7da17 in WaitEventSetWaitBlock (set=0x559cce247da0, cur_timeout=-1, occurred_events=0x7fffb26f7430, nevents=1)
at /pg15.1/src/backend/storage/ipc/latch.c:1489
#2 0x0000559cccc7d7c7 in WaitEventSetWait (set=0x559cce247da0, timeout=-1, occurred_events=0x7fffb26f7430, nevents=1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:1435
#3 0x0000559cccc7c343 in WaitLatchOrSocket (latch=0x7fa6364074d4, wakeEvents=35, sock=5, timeout=-1, wait_event_info=100663300)
at /pg15.1/src/backend/storage/ipc/latch.c:562
#4 0x00007fa6373761df in libpqrcv_PQgetResult (streamConn=0x559cce2496f0)
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:708
#5 0x00007fa63737602b in libpqrcv_PQexec (streamConn=0x559cce2496f0,
query=0x559cce2480e8 "CREATE_REPLICATION_SLOT \"pg_16403_sync_16396_7188141898123190919\" LOGICAL pgoutput (SNAPSHOT 'use')")
at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:668
#6 0x00007fa6373770da in libpqrcv_create_slot (conn=0x559cce13bc58, slotname=0x559cce1a6ac8 "pg_16403_sync_16396_7188141898123190919", temporary=false, two_phase=false,
snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7fffb26f7728) at /pg15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:944
#7 0x0000559cccbe1e71 in LogicalRepSyncTableStart (origin_startpos=0x7fffb26f7728) at /pg15.1/src/backend/replication/logical/tablesync.c:1353
#8 0x0000559cccbedcb0 in start_table_sync (origin_startpos=0x7fffb26f7728, myslotname=0x7fffb26f7730)
at /pg15.1/src/backend/replication/logical/worker.c:3525
#9 0x0000559cccbee712 in ApplyWorkerMain (main_arg=1) at /pg15.1/src/backend/replication/logical/worker.c:3688
#10 0x0000559cccb69cb1 in StartBackgroundWorker () at /pg15.1/src/backend/postmaster/bgworker.c:858
#11 0x0000559cccb7ed6f in do_start_bgworker (rw=0x559cce1a2190) at /pg15.1/src/backend/postmaster/postmaster.c:5823
#12 0x0000559cccb7f50b in maybe_start_bgworkers () at /pg15.1/src/backend/postmaster/postmaster.c:6047
#13 0x0000559cccb7d2ae in sigusr1_handler (postgres_signal_arg=10) at /pg15.1/src/backend/postmaster/postmaster.c:5204
#14 <signal handler called>
#15 0x00007fa636c2e804 in select () from /usr/lib/libc.so.6
#16 0x0000559cccb740f4 in ServerLoop () at /pg15.1/src/backend/postmaster/postmaster.c:1770
#17 0x0000559cccb73570 in PostmasterMain (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/postmaster/postmaster.c:1478
#18 0x0000559ccc9bf1bc in main (argc=7, argv=0x559cce139b00) at /pg15.1/src/backend/main/main.c:202
Relevant pg_stat_activity output :
db_publisher=# select pid, datname, query from pg_stat_activity order by pid;
pid | datname | query
-------+---------------+--------------------------------------------------------------------------------------------------------------------
66190 | |
66191 | |
66193 | |
66194 | |
66195 | |
66199 | db_subscriber | DROP DATABASE dummy;
66201 | db_subscriber |
66202 | db_publisher | START_REPLICATION SLOT "test_subscription" LOGICAL 0/0 (proto_version '3', publication_names '"test_publication"')
66203 | db_subscriber |
66204 | db_publisher | CREATE_REPLICATION_SLOT "pg_16403_sync_16396_7188141898123190919" LOGICAL pgoutput (SNAPSHOT 'use')
66224 | db_publisher | select pid, datname, query from pg_stat_activity order by pid;
(11 rows)
PID 66204 (CREATE_REPLICATION_SLOT) is waiting for a transaction lock, which in turn is held by the process executing DROP DATABASE (PID 66199) thus causing the deadlock.
Relevant output from pg_locks :
db_publisher=# select locktype, transactionid, pid, granted from pg_locks where transactionid = 732;
locktype | transactionid | pid | granted
---------------+---------------+-------+---------
transactionid | 732 | 66199 | t
transactionid | 732 | 66204 | f
(2 rows)
Regards
Lakshmi Narayanan
Attachment
Hi, Good catch. The problem is here: On 2023-01-13 20:53:49 +0530, Lakshmi Narayanan Sreethar wrote: > #7 0x0000559cccbe1e71 in LogicalRepSyncTableStart > (origin_startpos=0x7fffb26f7728) at > /pg15.1/src/backend/replication/logical/tablesync.c:1353 Because the logical rep code explicitly prevents interrupts: /* * Create a new permanent logical decoding slot. This slot will be used * for the catchup phase after COPY is done, so tell it to use the * snapshot to make the final data consistent. * * Prevent cancel/die interrupts while creating slot here because it is * possible that before the server finishes this command, a concurrent * drop subscription happens which would complete without removing this * slot leading to a dangling slot on the server. */ HOLD_INTERRUPTS(); walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ , false /* two_phase */ , CRS_USE_SNAPSHOT, origin_startpos); RESUME_INTERRUPTS(); Which is just completely entirely wrong. Independent of this issue even. Not allowing termination for the duration of command executed over network? This is from: commit 6b67d72b604cb913e39324b81b61ab194d94cba0 Author: Amit Kapila <akapila@postgresql.org> Date: 2021-03-17 08:15:12 +0530 Fix race condition in drop subscription's handling of tablesync slots. Commit ce0fdbfe97 made tablesync slots permanent and allow Drop Subscription to drop such slots. However, it is possible that before tablesync worker could get the acknowledgment of slot creation, drop subscription stops it and that can lead to a dangling slot on the publisher. Prevent cancel/die interrupts while creating a slot in the tablesync worker. Reported-by: Thomas Munro as per buildfarm Author: Amit Kapila Reviewed-by: Vignesh C, Takamichi Osumi Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com But this can't be the right fix. Greetings, Andres Freund
Hi, On 2023-01-14 08:02:01 -0800, Andres Freund wrote: > Because the logical rep code explicitly prevents interrupts: > > /* > * Create a new permanent logical decoding slot. This slot will be used > * for the catchup phase after COPY is done, so tell it to use the > * snapshot to make the final data consistent. > * > * Prevent cancel/die interrupts while creating slot here because it is > * possible that before the server finishes this command, a concurrent > * drop subscription happens which would complete without removing this > * slot leading to a dangling slot on the server. > */ > HOLD_INTERRUPTS(); > walrcv_create_slot(LogRepWorkerWalRcvConn, > slotname, false /* permanent */ , false /* two_phase */ , > CRS_USE_SNAPSHOT, origin_startpos); > RESUME_INTERRUPTS(); > > Which is just completely entirely wrong. Independent of this issue even. Not > allowing termination for the duration of command executed over network? > > This is from: > > commit 6b67d72b604cb913e39324b81b61ab194d94cba0 > Author: Amit Kapila <akapila@postgresql.org> > Date: 2021-03-17 08:15:12 +0530 > > Fix race condition in drop subscription's handling of tablesync slots. > > Commit ce0fdbfe97 made tablesync slots permanent and allow Drop > Subscription to drop such slots. However, it is possible that before > tablesync worker could get the acknowledgment of slot creation, drop > subscription stops it and that can lead to a dangling slot on the > publisher. Prevent cancel/die interrupts while creating a slot in the > tablesync worker. > > Reported-by: Thomas Munro as per buildfarm > Author: Amit Kapila > Reviewed-by: Vignesh C, Takamichi Osumi > Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com > > > But this can't be the right fix. I wonder if we ought to put a Assert(InterruptHoldoffCount == 0 && CritSectionCount == 0) in some of the routines in libpqwalreceiver to protect against issues like this. It'd be easy enough to introduce one accidentally, due to holding an lwlock. Greetings, Andres Freund
On Sat, Jan 14, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote: > > The problem is here: > > On 2023-01-13 20:53:49 +0530, Lakshmi Narayanan Sreethar wrote: > > #7 0x0000559cccbe1e71 in LogicalRepSyncTableStart > > (origin_startpos=0x7fffb26f7728) at > > /pg15.1/src/backend/replication/logical/tablesync.c:1353 > > Because the logical rep code explicitly prevents interrupts: > > /* > * Create a new permanent logical decoding slot. This slot will be used > * for the catchup phase after COPY is done, so tell it to use the > * snapshot to make the final data consistent. > * > * Prevent cancel/die interrupts while creating slot here because it is > * possible that before the server finishes this command, a concurrent > * drop subscription happens which would complete without removing this > * slot leading to a dangling slot on the server. > */ > HOLD_INTERRUPTS(); > walrcv_create_slot(LogRepWorkerWalRcvConn, > slotname, false /* permanent */ , false /* two_phase */ , > CRS_USE_SNAPSHOT, origin_startpos); > RESUME_INTERRUPTS(); > > Which is just completely entirely wrong. Independent of this issue even. Not > allowing termination for the duration of command executed over network? > > This is from: > > commit 6b67d72b604cb913e39324b81b61ab194d94cba0 > Author: Amit Kapila <akapila@postgresql.org> > Date: 2021-03-17 08:15:12 +0530 > > Fix race condition in drop subscription's handling of tablesync slots. > > Commit ce0fdbfe97 made tablesync slots permanent and allow Drop > Subscription to drop such slots. However, it is possible that before > tablesync worker could get the acknowledgment of slot creation, drop > subscription stops it and that can lead to a dangling slot on the > publisher. Prevent cancel/die interrupts while creating a slot in the > tablesync worker. > > Reported-by: Thomas Munro as per buildfarm > Author: Amit Kapila > Reviewed-by: Vignesh C, Takamichi Osumi > Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com > > > But this can't be the right fix. > I will look into this and your suggestion in a later email. -- With Regards, Amit Kapila.
On Mon, Jan 16, 2023 at 3:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jan 14, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote: > > > > The problem is here: > > > > On 2023-01-13 20:53:49 +0530, Lakshmi Narayanan Sreethar wrote: > > > #7 0x0000559cccbe1e71 in LogicalRepSyncTableStart > > > (origin_startpos=0x7fffb26f7728) at > > > /pg15.1/src/backend/replication/logical/tablesync.c:1353 > > > > Because the logical rep code explicitly prevents interrupts: > > > > /* > > * Create a new permanent logical decoding slot. This slot will be used > > * for the catchup phase after COPY is done, so tell it to use the > > * snapshot to make the final data consistent. > > * > > * Prevent cancel/die interrupts while creating slot here because it is > > * possible that before the server finishes this command, a concurrent > > * drop subscription happens which would complete without removing this > > * slot leading to a dangling slot on the server. > > */ > > HOLD_INTERRUPTS(); > > walrcv_create_slot(LogRepWorkerWalRcvConn, > > slotname, false /* permanent */ , false /* two_phase */ , > > CRS_USE_SNAPSHOT, origin_startpos); > > RESUME_INTERRUPTS(); > > > > Which is just completely entirely wrong. Independent of this issue even. Not > > allowing termination for the duration of command executed over network? > > > > This is from: > > > > commit 6b67d72b604cb913e39324b81b61ab194d94cba0 > > Author: Amit Kapila <akapila@postgresql.org> > > Date: 2021-03-17 08:15:12 +0530 > > > > Fix race condition in drop subscription's handling of tablesync slots. > > ... ... > > > > > > But this can't be the right fix. > > > > I will look into this > As per my initial analysis, I have added this code to hold/resume interrupts during slot creation due to the test failure (in buildfarm) reported in the email [1]. It is clearly a wrong fix as per the report and discussion in this thread. There is an analysis of the test failure in the email [2] which explains the race condition that leads to test failure. Thinking again about the failure, I feel we can instead change the failed test (t/004_sync.pl) to either ensure that both the walsenders (corresponding to sync worker and apply worker) exits after dropping the subscription and before checking the remaining slots on publisher or wait for slots to become zero in the test. As mentioned in email [2], such a slot created on the server will be dropped before we persist them, so there doesn't appear to be a problem in removing the change to hold/resume interrupts and instead change the test but will study this more and share the same in the next update. [1] - https://www.postgresql.org/message-id/CA%2BhUKGJG9dWpw1cOQ2nzWU8PHjm%3DPTraB%2BKgE5648K9nTfwvxg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1JnDQF_B9vDj7WQz9uufbMfE5wo%3DmdG5fbHq9S7Unfhtg%40mail.gmail.com -- With Regards, Amit Kapila.
On Sat, Jan 14, 2023 at 10:50 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-01-14 08:02:01 -0800, Andres Freund wrote: > > I wonder if we ought to put a > Assert(InterruptHoldoffCount == 0 && CritSectionCount == 0) > > in some of the routines in libpqwalreceiver to protect against issues like > this. It'd be easy enough to introduce one accidentally, due to holding an > lwlock. > This makes sense to me. How about adding this assert in all routines in libpqwalreceiver that communicate over the network? That means to add assert except for routines libpqrcv_get_senderinfo, libpqrcv_get_conninfo, libpqrcv_check_conninfo, libpqrcv_server_version, libpqrcv_get_backend_pid where we don't communicate over the network. There would be one exception to the above which is that we can't add this assert in libpqrcv_disconnect as that will be called at the time of proc exit and we set InterruptHoldoffCount as 1 in proc_exit_prepare. I think that should be fine. -- With Regards, Amit Kapila.
Hi, On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > As per my initial analysis, I have added this code to hold/resume > interrupts during slot creation due to the test failure (in buildfarm) > reported in the email [1]. It is clearly a wrong fix as per the report > and discussion in this thread. Yea. You really can never hold interrupts across some thing that could indefinitely be blocked. A HOLD_INTERRUPTS() while doing error recovery (as in DisableSubscriptionAndExit()) is fine, that's basically a finite amount of work. But doing so while issuing SQL commands to another node, or anything else that could just block indefinitely, isn't. > There is an analysis of the test > failure in the email [2] which explains the race condition that leads > to test failure. Thinking again about the failure, I feel we can > instead change the failed test (t/004_sync.pl) to either ensure that > both the walsenders (corresponding to sync worker and apply worker) > exits after dropping the subscription and before checking the > remaining slots on publisher or wait for slots to become zero in the > test. How about waiting for the table to start to be synced (and thus the slot to be created) before issuing the drop subscription? If the slot hadn't yet been created, the test doesn't prove that we successfully clean up... Greetings, Andres Freund
On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > There is an analysis of the test > > failure in the email [2] which explains the race condition that leads > > to test failure. Thinking again about the failure, I feel we can > > instead change the failed test (t/004_sync.pl) to either ensure that > > both the walsenders (corresponding to sync worker and apply worker) > > exits after dropping the subscription and before checking the > > remaining slots on publisher or wait for slots to become zero in the > > test. > > How about waiting for the table to start to be synced (and thus the slot to be > created) before issuing the drop subscription? > In this test [1], the initial sync fails due to a unique constraint violation, so checking that the sync has started is a bit tricky. We can probably check sync_error_count in pg_stat_subscription_stats to ensure that sync has started to fail which will ideally ensure that the sync has started. I am not sure this would be completely safe. The other possible ways are (a) after creating a subscription, wait for two slots to get created in the publisher, and then after dropping subscription wait for slots to become zero on the publisher; (b) after dropping the subscription, wait for slots to become zero. I think one of (a) or (b) will work. [1] # Table tap_rep already has the same records on both publisher and subscriber # at this time. Recreate the subscription which will do the initial copy of # the table again and fails due to unique constraint violation. $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub" ); ... ... -- With Regards, Amit Kapila.
RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1
From
"houzj.fnst@fujitsu.com"
Date:
On Wednesday, January 18, 2023 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > > > There is an analysis of the test > > > failure in the email [2] which explains the race condition that > > > leads to test failure. Thinking again about the failure, I feel we > > > can instead change the failed test (t/004_sync.pl) to either ensure > > > that both the walsenders (corresponding to sync worker and apply > > > worker) exits after dropping the subscription and before checking > > > the remaining slots on publisher or wait for slots to become zero in > > > the test. > > > > How about waiting for the table to start to be synced (and thus the > > slot to be > > created) before issuing the drop subscription? > > > > In this test [1], the initial sync fails due to a unique constraint violation, so > checking that the sync has started is a bit tricky. We can probably check > sync_error_count in pg_stat_subscription_stats to ensure that sync has started to > fail which will ideally ensure that the sync has started. I am not sure this would be > completely safe. The other possible ways are (a) after creating a subscription, > wait for two slots to get created in the publisher, and then after dropping > subscription wait for slots to become zero on the publisher; (b) after dropping > the subscription, wait for slots to become zero. > > I think one of (a) or (b) will work. I think in the mentioned testcase, the tablesync worker will keep restarting which means the table sync slot is also being dropped and re-created ... . So, (a) waiting for two slots to get created might not work as the slot will get dropped soon. I think (b) waiting for slot to become zero would be a simpler way to make the test stable. And here are the patches that tries to do it for all affected branches. Best regards, Hou zj
Attachment
RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1
From
"houzj.fnst@fujitsu.com"
Date:
On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > On Wednesday, January 18, 2023 12:32 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > > > > > There is an analysis of the test > > > > failure in the email [2] which explains the race condition that > > > > leads to test failure. Thinking again about the failure, I feel we > > > > can instead change the failed test (t/004_sync.pl) to either > > > > ensure that both the walsenders (corresponding to sync worker and > > > > apply > > > > worker) exits after dropping the subscription and before checking > > > > the remaining slots on publisher or wait for slots to become zero > > > > in the test. > > > > > > How about waiting for the table to start to be synced (and thus the > > > slot to be > > > created) before issuing the drop subscription? > > > > > > > In this test [1], the initial sync fails due to a unique constraint > > violation, so checking that the sync has started is a bit tricky. We > > can probably check sync_error_count in pg_stat_subscription_stats to > > ensure that sync has started to fail which will ideally ensure that > > the sync has started. I am not sure this would be completely safe. The > > other possible ways are (a) after creating a subscription, wait for > > two slots to get created in the publisher, and then after dropping > > subscription wait for slots to become zero on the publisher; (b) after dropping > the subscription, wait for slots to become zero. > > > > I think one of (a) or (b) will work. > > I think in the mentioned testcase, the tablesync worker will keep restarting which > means the table sync slot is also being dropped and re-created ... . So, (a) waiting > for two slots to get created might not work as the slot will get dropped soon. I > think (b) waiting for slot to become zero would be a simpler way to make the test > stable. And here are the patches that tries to do it for all affected branches. When testing the patches on back-branches, I find that the reported deadlock problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176 which introduces the WaitForProcSignalBarrier logic in dropdb(). After this commit, when executing the bug-reproduction sql script, the DROP DATABASE will wait for the table sync worker to accept ProcSignalBarrier which can cause the reported deadlock problem. But I think it's still worth fixing this on PG14 as well, as it would be better to allow termination when executing command over network. Best regards, Hou zj
On Thu, Jan 19, 2023 at 3:46 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, January 18, 2023 12:32 PM Amit Kapila > > <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > > > > > > > There is an analysis of the test > > > > > failure in the email [2] which explains the race condition that > > > > > leads to test failure. Thinking again about the failure, I feel we > > > > > can instead change the failed test (t/004_sync.pl) to either > > > > > ensure that both the walsenders (corresponding to sync worker and > > > > > apply > > > > > worker) exits after dropping the subscription and before checking > > > > > the remaining slots on publisher or wait for slots to become zero > > > > > in the test. > > > > > > > > How about waiting for the table to start to be synced (and thus the > > > > slot to be > > > > created) before issuing the drop subscription? > > > > > > > > > > In this test [1], the initial sync fails due to a unique constraint > > > violation, so checking that the sync has started is a bit tricky. We > > > can probably check sync_error_count in pg_stat_subscription_stats to > > > ensure that sync has started to fail which will ideally ensure that > > > the sync has started. I am not sure this would be completely safe. The > > > other possible ways are (a) after creating a subscription, wait for > > > two slots to get created in the publisher, and then after dropping > > > subscription wait for slots to become zero on the publisher; (b) after dropping > > the subscription, wait for slots to become zero. > > > > > > I think one of (a) or (b) will work. > > > > I think in the mentioned testcase, the tablesync worker will keep restarting which > > means the table sync slot is also being dropped and re-created ... . So, (a) waiting > > for two slots to get created might not work as the slot will get dropped soon. I > > think (b) waiting for slot to become zero would be a simpler way to make the test > > stable. And here are the patches that tries to do it for all affected branches. > > When testing the patches on back-branches, I find that the reported deadlock > problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176 > which introduces the WaitForProcSignalBarrier logic in dropdb(). After this > commit, when executing the bug-reproduction sql script, the DROP DATABASE will > wait for the table sync worker to accept ProcSignalBarrier which can cause the > reported deadlock problem. > > But I think it's still worth fixing this on PG14 as well, as it would be better > to allow termination when executing command over network. > I also think that this should be backpatched in PG14 as well. -- With Regards, Amit Kapila.
On Thu, 19 Jan 2023 at 12:44, houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, January 18, 2023 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > > > > > There is an analysis of the test > > > > failure in the email [2] which explains the race condition that > > > > leads to test failure. Thinking again about the failure, I feel we > > > > can instead change the failed test (t/004_sync.pl) to either ensure > > > > that both the walsenders (corresponding to sync worker and apply > > > > worker) exits after dropping the subscription and before checking > > > > the remaining slots on publisher or wait for slots to become zero in > > > > the test. > > > > > > How about waiting for the table to start to be synced (and thus the > > > slot to be > > > created) before issuing the drop subscription? > > > > > > > In this test [1], the initial sync fails due to a unique constraint violation, so > > checking that the sync has started is a bit tricky. We can probably check > > sync_error_count in pg_stat_subscription_stats to ensure that sync has started to > > fail which will ideally ensure that the sync has started. I am not sure this would be > > completely safe. The other possible ways are (a) after creating a subscription, > > wait for two slots to get created in the publisher, and then after dropping > > subscription wait for slots to become zero on the publisher; (b) after dropping > > the subscription, wait for slots to become zero. > > > > I think one of (a) or (b) will work. > > I think in the mentioned testcase, the tablesync worker will keep restarting which > means the table sync slot is also being dropped and re-created ... . So, (a) waiting for > two slots to get created might not work as the slot will get dropped soon. I > think (b) waiting for slot to become zero would be a simpler way to make the test > stable. And here are the patches that tries to do it for all affected branches. Thanks for the patch, the deadlock issue gets resolved with the shared patch. I had tested in HEAD, PG15 and PG14 and found no issues and make check-world works fine. Regards, Vignesh
On Thu, Jan 19, 2023 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 19, 2023 at 3:46 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > > On Wednesday, January 18, 2023 12:32 PM Amit Kapila > > > <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > > > > > > > > > There is an analysis of the test > > > > > > failure in the email [2] which explains the race condition that > > > > > > leads to test failure. Thinking again about the failure, I feel we > > > > > > can instead change the failed test (t/004_sync.pl) to either > > > > > > ensure that both the walsenders (corresponding to sync worker and > > > > > > apply > > > > > > worker) exits after dropping the subscription and before checking > > > > > > the remaining slots on publisher or wait for slots to become zero > > > > > > in the test. > > > > > > > > > > How about waiting for the table to start to be synced (and thus the > > > > > slot to be > > > > > created) before issuing the drop subscription? > > > > > > > > > > > > > In this test [1], the initial sync fails due to a unique constraint > > > > violation, so checking that the sync has started is a bit tricky. We > > > > can probably check sync_error_count in pg_stat_subscription_stats to > > > > ensure that sync has started to fail which will ideally ensure that > > > > the sync has started. I am not sure this would be completely safe. The > > > > other possible ways are (a) after creating a subscription, wait for > > > > two slots to get created in the publisher, and then after dropping > > > > subscription wait for slots to become zero on the publisher; (b) after dropping > > > the subscription, wait for slots to become zero. > > > > > > > > I think one of (a) or (b) will work. > > > > > > I think in the mentioned testcase, the tablesync worker will keep restarting which > > > means the table sync slot is also being dropped and re-created ... . So, (a) waiting > > > for two slots to get created might not work as the slot will get dropped soon. I > > > think (b) waiting for slot to become zero would be a simpler way to make the test > > > stable. And here are the patches that tries to do it for all affected branches. > > > > When testing the patches on back-branches, I find that the reported deadlock > > problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176 > > which introduces the WaitForProcSignalBarrier logic in dropdb(). After this > > commit, when executing the bug-reproduction sql script, the DROP DATABASE will > > wait for the table sync worker to accept ProcSignalBarrier which can cause the > > reported deadlock problem. > > > > But I think it's still worth fixing this on PG14 as well, as it would be better > > to allow termination when executing command over network. > > > > I also think that this should be backpatched in PG14 as well. > I have verified it and it looks good to me. I'll push and backpatch this till PG14 on Tuesday unless there are more comments/suggestions. -- With Regards, Amit Kapila.
On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I have verified it and it looks good to me. I'll push and backpatch > this till PG14 on Tuesday unless there are more comments/suggestions. > Pushed. -- With Regards, Amit Kapila.
Hi, On January 23, 2023 9:35:07 PM PST, Amit Kapila <amit.kapila16@gmail.com> wrote: >On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> I have verified it and it looks good to me. I'll push and backpatch >> this till PG14 on Tuesday unless there are more comments/suggestions. >> > >Pushed. Thanks!
Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
From
"Andrey M. Borodin"
Date:
> On 24 Jan 2023, at 10:35, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> I have verified it and it looks good to me. I'll push and backpatch >> this till PG14 on Tuesday unless there are more comments/suggestions. >> > > Pushed. Hi! I've observed similar problem on one of our production clusters. Unfortunately, I barely managed to capture some debug information.Cluster was deleted right after I started investigating the problem. PG 15.3. "Drop database" was hanging in (gdb) bt #0 0x00007f215baa7907 in epoll_wait (epfd=13, events=0x563a71d25278, maxevents=1, timeout=timeout@entry=5000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffe77d40b0, cur_timeout=5000, set=0x563a71d25218)at ./build/../src/backend/storage/ipc/latch.c:1489 #2 WaitEventSetWait (set=0x563a71d25218, timeout=timeout@entry=5000, occurred_events=occurred_events@entry=0x7fffe77d4100,nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765) at ./build/../src/backend/storage/ipc/latch.c:1435 #3 0x0000563a71388c88 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=5000, wait_event_info=wait_event_info@entry=134217765) at ./build/../src/backend/storage/ipc/latch.c:497 #4 0x0000563a713951f6 in ConditionVariableTimedSleep (cv=cv@entry=0x7f2150627f1c, timeout=timeout@entry=5000, wait_event_info=wait_event_info@entry=134217765) at ./build/../src/backend/storage/lmgr/condition_variable.c:163 #5 0x0000563a7138e93b in WaitForProcSignalBarrier (generation=2) at ./build/../src/backend/storage/ipc/procsignal.c:415 #6 0x0000563a711b2339 in dropdb (dbname=<optimized out>, missing_ok=<optimized out>, force=<optimized out>) at ./build/../src/backend/commands/dbcommands.c:1690 #7 0x0000563a713b3ef3 in standard_ProcessUtility (pstmt=pstmt@entry=0x563a71d2ef18, queryString=queryString@entry=0x563a71d2e218"DROP DATABASE \"db1\"", readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=0x563a71d2eff8,qc=0x7fffe77d4860) at ./build/../src/backend/tcop/utility.c:797 #8 0x00007f2154894443 in pgss_ProcessUtility (pstmt=0x563a71d2ef18, queryString=0x563a71d2e218 "DROP DATABASE \"db1\"",readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1206 #9 0x0000563a713b2015 in PortalRunUtility (portal=portal@entry=0x563a71e1af58, pstmt=pstmt@entry=0x563a71d2ef18, isTopLevel=isTopLevel@entry=true,setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:1158 #10 0x0000563a713b2140 in PortalRunMulti (portal=portal@entry=0x563a71e1af58, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x563a71d2eff8, altdest=altdest@entry=0x563a71d2eff8, qc=qc@entry=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:1322 #11 0x0000563a713b25cb in PortalRun (portal=0x563a71e1af58, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimizedout>, dest=0x563a71d2eff8, altdest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:791 #12 0x0000563a71054f5e in exec_simple_query (query_string=0x563a71d2e218 "DROP DATABASE \"db1\"") at ./build/../src/backend/tcop/postgres.c:1250 #13 0x0000563a713b0c64 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4593 #14 0x0000563a7132385a in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4518 #15 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4246 #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806 #17 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1478 #18 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) at ./build/../src/backend/main/main.c:202 (gdb) select-frame 5 (gdb) p ProcSignal->psh_slot[i] $1 = {pss_pid = 9131, pss_signalFlags = {0 <repeats 12 times>}, pss_barrierGeneration = {value = 0}, pss_barrierCheckMask= {value = 1}, pss_barrierCV = {mutex = 0 '\000', wakeup = {head = 1565, tail = 1565}}} (gdb) p generation $2 = 2 There were 3 inactive logical replication slots with disabled interrupts and bt like this 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, maxevents=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #0 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, maxevents=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, occurred_events=0x1, cur_timeout=-1, set=0x563a71d260d8) at ./build/../src/backend/storage/ipc/latch.c:1489 #2 WaitEventSetWait (set=0x563a71d260d8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fffe77d4850,nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296)at ./build/../src/backend/storage/ipc/latch.c:1435 #3 0x0000563a71287747 in secure_read (port=0x563a71d58780, ptr=0x563a719c7140 <PqRecvBuffer>, len=8192) at ./build/../src/backend/libpq/be-secure.c:186 #4 0x0000563a7128e463 in pq_recvbuf () at ./build/../src/backend/libpq/pqcomm.c:957 #5 0x0000563a7128f495 in pq_getbyte () at ./build/../src/backend/libpq/pqcomm.c:1000 #6 0x0000563a713aed02 in SocketBackend (inBuf=0x7fffe77d4a30) at ./build/../src/backend/tcop/postgres.c:351 #7 ReadCommand (inBuf=0x7fffe77d4a30) at ./build/../src/backend/tcop/postgres.c:474 #8 PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4525 #9 0x0000563a7132385a in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4518 #10 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4246 #11 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806 #12 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1478 #13 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) at ./build/../src/backend/main/main.c:202 in pg_stat_activity query was like "create table...", but backend type was walsender and it was idle. Unfortunately, that's all information that I could collect. I'd appreciate any hints on what information to collect if I'llsee this happen again. Best regards, Andrey Borodin.
On Tue, Aug 1, 2023 at 6:45 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > On 24 Jan 2023, at 10:35, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> > >> I have verified it and it looks good to me. I'll push and backpatch > >> this till PG14 on Tuesday unless there are more comments/suggestions. > >> > > > > Pushed. > > Hi! > > I've observed similar problem on one of our production clusters. Unfortunately, I barely managed to capture some debuginformation. Cluster was deleted right after I started investigating the problem. > > PG 15.3. > > "Drop database" was hanging in > (gdb) bt > #0 0x00007f215baa7907 in epoll_wait (epfd=13, events=0x563a71d25278, maxevents=1, timeout=timeout@entry=5000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffe77d40b0, cur_timeout=5000, set=0x563a71d25218)at ./build/../src/backend/storage/ipc/latch.c:1489 > #2 WaitEventSetWait (set=0x563a71d25218, timeout=timeout@entry=5000, occurred_events=occurred_events@entry=0x7fffe77d4100,nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765) > at ./build/../src/backend/storage/ipc/latch.c:1435 > #3 0x0000563a71388c88 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=5000,wait_event_info=wait_event_info@entry=134217765) > at ./build/../src/backend/storage/ipc/latch.c:497 > #4 0x0000563a713951f6 in ConditionVariableTimedSleep (cv=cv@entry=0x7f2150627f1c, timeout=timeout@entry=5000, wait_event_info=wait_event_info@entry=134217765) > at ./build/../src/backend/storage/lmgr/condition_variable.c:163 > #5 0x0000563a7138e93b in WaitForProcSignalBarrier (generation=2) at ./build/../src/backend/storage/ipc/procsignal.c:415 > #6 0x0000563a711b2339 in dropdb (dbname=<optimized out>, missing_ok=<optimized out>, force=<optimized out>) at ./build/../src/backend/commands/dbcommands.c:1690 > #7 0x0000563a713b3ef3 in standard_ProcessUtility (pstmt=pstmt@entry=0x563a71d2ef18, queryString=queryString@entry=0x563a71d2e218"DROP DATABASE \"db1\"", readOnlyTree=<optimized out>, > context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=0x563a71d2eff8,qc=0x7fffe77d4860) at ./build/../src/backend/tcop/utility.c:797 > #8 0x00007f2154894443 in pgss_ProcessUtility (pstmt=0x563a71d2ef18, queryString=0x563a71d2e218 "DROP DATABASE \"db1\"",readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, > queryEnv=0x0, dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1206 > #9 0x0000563a713b2015 in PortalRunUtility (portal=portal@entry=0x563a71e1af58, pstmt=pstmt@entry=0x563a71d2ef18, isTopLevel=isTopLevel@entry=true,setHoldSnapshot=setHoldSnapshot@entry=false, > dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:1158 > #10 0x0000563a713b2140 in PortalRunMulti (portal=portal@entry=0x563a71e1af58, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x563a71d2eff8, > altdest=altdest@entry=0x563a71d2eff8, qc=qc@entry=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:1322 > #11 0x0000563a713b25cb in PortalRun (portal=0x563a71e1af58, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimizedout>, dest=0x563a71d2eff8, altdest=0x563a71d2eff8, qc=0x7fffe77d4860) > at ./build/../src/backend/tcop/pquery.c:791 > #12 0x0000563a71054f5e in exec_simple_query (query_string=0x563a71d2e218 "DROP DATABASE \"db1\"") at ./build/../src/backend/tcop/postgres.c:1250 > #13 0x0000563a713b0c64 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4593 > #14 0x0000563a7132385a in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4518 > #15 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4246 > #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806 > #17 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1478 > #18 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) at ./build/../src/backend/main/main.c:202 > (gdb) select-frame 5 > (gdb) p ProcSignal->psh_slot[i] > $1 = {pss_pid = 9131, pss_signalFlags = {0 <repeats 12 times>}, pss_barrierGeneration = {value = 0}, pss_barrierCheckMask= {value = 1}, pss_barrierCV = {mutex = 0 '\000', wakeup = {head = 1565, tail = 1565}}} > (gdb) p generation > $2 = 2 > > > There were 3 inactive logical replication slots with disabled interrupts and bt like this > 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, maxevents=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #0 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, maxevents=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, occurred_events=0x1, cur_timeout=-1, set=0x563a71d260d8) at./build/../src/backend/storage/ipc/latch.c:1489 > #2 WaitEventSetWait (set=0x563a71d260d8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7fffe77d4850,nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296)at ./build/../src/backend/storage/ipc/latch.c:1435 > #3 0x0000563a71287747 in secure_read (port=0x563a71d58780, ptr=0x563a719c7140 <PqRecvBuffer>, len=8192) at ./build/../src/backend/libpq/be-secure.c:186 > #4 0x0000563a7128e463 in pq_recvbuf () at ./build/../src/backend/libpq/pqcomm.c:957 > #5 0x0000563a7128f495 in pq_getbyte () at ./build/../src/backend/libpq/pqcomm.c:1000 > #6 0x0000563a713aed02 in SocketBackend (inBuf=0x7fffe77d4a30) at ./build/../src/backend/tcop/postgres.c:351 > #7 ReadCommand (inBuf=0x7fffe77d4a30) at ./build/../src/backend/tcop/postgres.c:474 > #8 PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4525 > #9 0x0000563a7132385a in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4518 > #10 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4246 > #11 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806 > #12 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1478 > #13 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) at ./build/../src/backend/main/main.c:202 > > in pg_stat_activity query was like "create table...", but backend type was walsender and it was idle. > Does this mean that Drop Database is waiting for walsender instead of tablesync worker? Can you show the pg_stat_activity row of backend for which the Drop Database command is waiting? -- With Regards, Amit Kapila.
RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1
From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, August 1, 2023 9:15 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: Hi, > > On 24 Jan 2023, at 10:35, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > >> > >> > >> I have verified it and it looks good to me. I'll push and backpatch > >> this till PG14 on Tuesday unless there are more comments/suggestions. > >> > > > > Pushed. > > Hi! > > I've observed similar problem on one of our production clusters. Unfortunately, > I barely managed to capture some debug information. Cluster was deleted > right after I started investigating the problem. > > PG 15.3. > > "Drop database" was hanging in > (gdb) bt > #0 0x00007f215baa7907 in epoll_wait (epfd=13, events=0x563a71d25278, > maxevents=1, timeout=timeout@entry=5000) > at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, > occurred_events=0x7fffe77d40b0, cur_timeout=5000, set=0x563a71d25218) > at ./build/../src/backend/storage/ipc/latch.c:1489 > #2 WaitEventSetWait (set=0x563a71d25218, timeout=timeout@entry=5000, > occurred_events=occurred_events@entry=0x7fffe77d4100, > nevents=nevents@entry=1, > wait_event_info=wait_event_info@entry=134217765) > at ./build/../src/backend/storage/ipc/latch.c:1435 > #3 0x0000563a71388c88 in WaitLatch (latch=<optimized out>, > wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=5000, > wait_event_info=wait_event_info@entry=134217765) > at ./build/../src/backend/storage/ipc/latch.c:497 > #4 0x0000563a713951f6 in ConditionVariableTimedSleep > (cv=cv@entry=0x7f2150627f1c, timeout=timeout@entry=5000, > wait_event_info=wait_event_info@entry=134217765) > at ./build/../src/backend/storage/lmgr/condition_variable.c:163 > #5 0x0000563a7138e93b in WaitForProcSignalBarrier (generation=2) > at ./build/../src/backend/storage/ipc/procsignal.c:415 > #6 0x0000563a711b2339 in dropdb (dbname=<optimized out>, > missing_ok=<optimized out>, force=<optimized out>) > at ./build/../src/backend/commands/dbcommands.c:1690 > #7 0x0000563a713b3ef3 in standard_ProcessUtility > (pstmt=pstmt@entry=0x563a71d2ef18, > queryString=queryString@entry=0x563a71d2e218 "DROP DATABASE \"db1\"", > readOnlyTree=<optimized out>, > context=context@entry=PROCESS_UTILITY_TOPLEVEL, > params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, > dest=0x563a71d2eff8, qc=0x7fffe77d4860) > at ./build/../src/backend/tcop/utility.c:797 > #8 0x00007f2154894443 in pgss_ProcessUtility (pstmt=0x563a71d2ef18, > queryString=0x563a71d2e218 "DROP DATABASE \"db1\"", > readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, > params=0x0, > queryEnv=0x0, dest=0x563a71d2eff8, qc=0x7fffe77d4860) > at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1206 > #9 0x0000563a713b2015 in PortalRunUtility > (portal=portal@entry=0x563a71e1af58, pstmt=pstmt@entry=0x563a71d2ef18, > isTopLevel=isTopLevel@entry=true, > setHoldSnapshot=setHoldSnapshot@entry=false, > dest=0x563a71d2eff8, qc=0x7fffe77d4860) > at ./build/../src/backend/tcop/pquery.c:1158 > #10 0x0000563a713b2140 in PortalRunMulti > (portal=portal@entry=0x563a71e1af58, isTopLevel=isTopLevel@entry=true, > setHoldSnapshot=setHoldSnapshot@entry=false, > dest=dest@entry=0x563a71d2eff8, > altdest=altdest@entry=0x563a71d2eff8, qc=qc@entry=0x7fffe77d4860) > at ./build/../src/backend/tcop/pquery.c:1322 > #11 0x0000563a713b25cb in PortalRun (portal=0x563a71e1af58, > count=9223372036854775807, isTopLevel=<optimized out>, > run_once=<optimized out>, dest=0x563a71d2eff8, altdest=0x563a71d2eff8, > qc=0x7fffe77d4860) > at ./build/../src/backend/tcop/pquery.c:791 > #12 0x0000563a71054f5e in exec_simple_query > (query_string=0x563a71d2e218 "DROP DATABASE \"db1\"") > at ./build/../src/backend/tcop/postgres.c:1250 > #13 0x0000563a713b0c64 in PostgresMain (dbname=<optimized out>, > username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4593 > #14 0x0000563a7132385a in BackendRun (port=<optimized out>, > port=<optimized out>) > at ./build/../src/backend/postmaster/postmaster.c:4518 > #15 BackendStartup (port=<optimized out>) > at ./build/../src/backend/postmaster/postmaster.c:4246 > #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806 > #17 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) > at ./build/../src/backend/postmaster/postmaster.c:1478 > #18 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) > at ./build/../src/backend/main/main.c:202 > (gdb) select-frame 5 > (gdb) p ProcSignal->psh_slot[i] > $1 = {pss_pid = 9131, pss_signalFlags = {0 <repeats 12 times>}, > pss_barrierGeneration = {value = 0}, pss_barrierCheckMask = {value = 1}, > pss_barrierCV = {mutex = 0 '\000', wakeup = {head = 1565, tail = 1565}}} > (gdb) p generation > $2 = 2 > > > There were 3 inactive logical replication slots with disabled interrupts and bt > like this > 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, > maxevents=1, timeout=timeout@entry=-1) > at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #0 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, > maxevents=1, timeout=timeout@entry=-1) > at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, > occurred_events=0x1, cur_timeout=-1, set=0x563a71d260d8) > at ./build/../src/backend/storage/ipc/latch.c:1489 > #2 WaitEventSetWait (set=0x563a71d260d8, timeout=timeout@entry=-1, > occurred_events=occurred_events@entry=0x7fffe77d4850, > nevents=nevents@entry=1, > wait_event_info=wait_event_info@entry=100663296) > at ./build/../src/backend/storage/ipc/latch.c:1435 > #3 0x0000563a71287747 in secure_read (port=0x563a71d58780, > ptr=0x563a719c7140 <PqRecvBuffer>, len=8192) > at ./build/../src/backend/libpq/be-secure.c:186 > #4 0x0000563a7128e463 in pq_recvbuf () > at ./build/../src/backend/libpq/pqcomm.c:957 > #5 0x0000563a7128f495 in pq_getbyte () > at ./build/../src/backend/libpq/pqcomm.c:1000 > #6 0x0000563a713aed02 in SocketBackend (inBuf=0x7fffe77d4a30) > at ./build/../src/backend/tcop/postgres.c:351 > #7 ReadCommand (inBuf=0x7fffe77d4a30) > at ./build/../src/backend/tcop/postgres.c:474 > #8 PostgresMain (dbname=<optimized out>, username=<optimized out>) > at ./build/../src/backend/tcop/postgres.c:4525 > #9 0x0000563a7132385a in BackendRun (port=<optimized out>, > port=<optimized out>) > at ./build/../src/backend/postmaster/postmaster.c:4518 > #10 BackendStartup (port=<optimized out>) > at ./build/../src/backend/postmaster/postmaster.c:4246 > #11 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806 > #12 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) > at ./build/../src/backend/postmaster/postmaster.c:1478 > #13 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) > at ./build/../src/backend/main/main.c:202 > > in pg_stat_activity query was like "create table...", but backend type was > walsender and it was idle. > > Unfortunately, that's all information that I could collect. I'd appreciate any hints > on what information to collect if I'll see this happen again. I think it would be better to get the whole info of pg_stat_activity when DROP database hangs. And the log is also needed as it may include log like "still waiting for backend with PID xxx to ...". Besides, you mentioned there were inactive slots, but just to confirm, is there the logical replication running(active) in your environment ? If yes, it would be better to collect the bt of the logical replication workers if exists. This is to check if the hang is related to the logical replication or not. Best Regards, Hou zj
Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
From
"Andrey M. Borodin"
Date:
Amit, Zhijie, thank you for your replies! > On 2 Aug 2023, at 10:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Does this mean that Drop Database is waiting for walsender instead of > tablesync worker? Can you show the pg_stat_activity row of backend for > which the Drop Database command is waiting? Yes, there were no active table syncs. Unfortunately, I do not have pg_stat_activity: the cluster was deleted amidst investigation session. > On 2 Aug 2023, at 11:12, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > I think it would be better to get the whole info of pg_stat_activity when DROP > database hangs. And the log is also needed as it may include log like "still > waiting for backend with PID xxx to ...". There are no logs: the cluster was operating in managed service. "DROP DATABASE" worker was operating with log level ERROR. > Besides, you mentioned there were inactive slots, but just to confirm, is there > the logical replication running(active) in your environment ? If yes, it would be better > to collect the bt of the logical replication workers if exists. This is to > check if the hang is related to the logical replication or not. The slots for logical replication were not from Postgres consumers. And receiving system as not running at all, so there'sno relevant logs. BTW there were 2 active slots of physical replication. Thank you! Best regards, Andrey Borodin.