"ERROR: deadlock detected" when replicating TRUNCATE - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | "ERROR: deadlock detected" when replicating TRUNCATE |
Date | |
Msg-id | CAHut+PsNm43p0jM+idTvWwiGZPcP0hGrHMPK9TOAkc+a4UpUqw@mail.gmail.com Whole thread Raw |
Responses |
Re: "ERROR: deadlock detected" when replicating TRUNCATE
Re: "ERROR: deadlock detected" when replicating TRUNCATE |
List | pgsql-hackers |
While doing logical replication testing we encountered a problem which causes a deadlock error to be logged when replicating a TRUNCATE simultaneously to 2 Subscriptions. e.g. ---------- 2021-05-12 11:30:19.457 AEST [11393] ERROR: deadlock detected 2021-05-12 11:30:19.457 AEST [11393] DETAIL: Process 11393 waits for ShareLock on transaction 597; blocked by process 11582. Process 11582 waits for ShareLock on relation 16384 of database 14896; blocked by process 11393. ---------- At this time, both the subscriber (apply worker) processes are executing inside the ExecuteTruncateGuts function simultaneously and they become co-dependent. e.g. ---------- Process 11582 (gdb) bt #0 0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8, cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at latch.c:1450 #2 0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648) at latch.c:1396 #3 0x000000000093d8cd in WaitLatch (latch=0x7fa191042654, wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473 #4 0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8, lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361 #5 0x0000000000954fd5 in WaitOnLock (locallock=0x2fd06d8, owner=0x2fd9a48) at lock.c:1858 #6 0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe0370, lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x7ffed5fe0368) at lock.c:1100 #7 0x00000000009511f1 in LockRelationOid (relid=16384, lockmode=5) at lmgr.c:117 #8 0x000000000049e779 in relation_open (relationId=16384, lockmode=5) at relation.c:56 #9 0x00000000005652ff in table_open (relationId=16384, lockmode=5) at table.c:43 #10 0x00000000005c8b5a in reindex_relation (relid=16384, flags=1, params=0x7ffed5fe05f0) at index.c:3990 #11 0x00000000006d2c85 in ExecuteTruncateGuts (explicit_rels=0x3068aa8, relids=0x3068b00, relids_extra=0x3068b58, relids_logged=0x3068bb0, behavior=DROP_RESTRICT, restart_seqs=false) at tablecmds.c:2036 #12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at worker.c:2252 #13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308 #14 0x00000000008ec424 in LogicalRepApplyLoop (last_received=24192624) at worker.c:2564 ---------- Process 11393 (gdb) bt #0 0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x00007fa197917e44 in sleep () from /lib64/libc.so.6 #2 0x0000000000950f84 in DeadLockReport () at deadlock.c:1151 #3 0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0, owner=0x2fd9a48) at lock.c:1873 #4 0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe01d0, lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x0) at lock.c:1100 #5 0x00000000009531bc in LockAcquire (locktag=0x7ffed5fe01d0, lockmode=5, sessionLock=false, dontWait=false) at lock.c:751 #6 0x0000000000951d86 in XactLockTableWait (xid=597, rel=0x7fa1986e9e08, ctid=0x7ffed5fe0284, oper=XLTW_Update) at lmgr.c:674 #7 0x00000000004f84d8 in heap_update (relation=0x7fa1986e9e08, otid=0x3067dc4, newtup=0x3067dc0, cid=0, crosscheck=0x0, wait=true, tmfd=0x7ffed5fe03b0, lockmode=0x7ffed5fe03ac) at heapam.c:3549 #8 0x00000000004fa5dd in simple_heap_update (relation=0x7fa1986e9e08, otid=0x3067dc4, tup=0x3067dc0) at heapam.c:4222 #9 0x00000000005c9932 in CatalogTupleUpdate (heapRel=0x7fa1986e9e08, otid=0x3067dc4, tup=0x3067dc0) at indexing.c:312 #10 0x0000000000af5774 in RelationSetNewRelfilenode (relation=0x7fa1986fdc80, persistence=112 'p') at relcache.c:3707 #11 0x00000000006d2b4d in ExecuteTruncateGuts (explicit_rels=0x30688b8, relids=0x3068910, relids_extra=0x3068968, relids_logged=0x30689c0, behavior=DROP_RESTRICT, restart_seqs=false) at tablecmds.c:2012 #12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at worker.c:2252 #13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308 ---------- The essence of the trouble seems to be that the apply_handle_truncate function never anticipated it may end up truncating the same table from 2 separate workers (subscriptions) like this test case is doing. Probably this is quite an old problem because the apply_handle_truncate code has not changed much for a long time. The code of apply_handle_truncate function (worker.c) has a very similar pattern to the ExecuteTruncate function (tablecmds.c) but the ExecuteTruncate is using a more powerful AcccessExclusiveLock than the apply_handle_truncate was using. ~~ PSA a patch to make the apply_handle_truncate use AccessExclusiveLock same as the ExecuteTruncate function does. PSA a patch adding a test for this scenario. -------- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
pgsql-hackers by date: