"ERROR: deadlock detected" when replicating TRUNCATE - Mailing list pgsql-hackers

From Peter Smith
Subject "ERROR: deadlock detected" when replicating TRUNCATE
Date
Msg-id CAHut+PsNm43p0jM+idTvWwiGZPcP0hGrHMPK9TOAkc+a4UpUqw@mail.gmail.com
Whole thread Raw
Responses Re: "ERROR: deadlock detected" when replicating TRUNCATE
Re: "ERROR: deadlock detected" when replicating TRUNCATE
List pgsql-hackers
While doing logical replication testing we encountered a problem which
causes a deadlock error to be logged when replicating a TRUNCATE
simultaneously to 2 Subscriptions.
e.g.
----------
2021-05-12 11:30:19.457 AEST [11393] ERROR:  deadlock detected
2021-05-12 11:30:19.457 AEST [11393] DETAIL:  Process 11393 waits for
ShareLock on transaction 597; blocked by process 11582.
Process 11582 waits for ShareLock on relation 16384 of database 14896;
blocked by process 11393.
----------

At this time, both the subscriber (apply worker) processes are
executing inside the ExecuteTruncateGuts function simultaneously and
they become co-dependent.

e.g.
----------
Process 11582
(gdb) bt
#0  0x00007fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x000000000093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
latch.c:1450
#2  0x000000000093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
at latch.c:1396
#3  0x000000000093d8cd in WaitLatch (latch=0x7fa191042654,
wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
#4  0x00000000009660f0 in ProcSleep (locallock=0x2fd06d8,
lockMethodTable=0xcd90a0 <default_lockmethod>) at proc.c:1361
#5  0x0000000000954fd5 in WaitOnLock (locallock=0x2fd06d8,
owner=0x2fd9a48) at lock.c:1858
#6  0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe0370,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x7ffed5fe0368) at lock.c:1100
#7  0x00000000009511f1 in LockRelationOid (relid=16384, lockmode=5) at
lmgr.c:117
#8  0x000000000049e779 in relation_open (relationId=16384, lockmode=5)
at relation.c:56
#9  0x00000000005652ff in table_open (relationId=16384, lockmode=5) at
table.c:43
#10 0x00000000005c8b5a in reindex_relation (relid=16384, flags=1,
params=0x7ffed5fe05f0) at index.c:3990
#11 0x00000000006d2c85 in ExecuteTruncateGuts
(explicit_rels=0x3068aa8, relids=0x3068b00, relids_extra=0x3068b58,
relids_logged=0x3068bb0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2036
#12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
#14 0x00000000008ec424 in LogicalRepApplyLoop (last_received=24192624)
at worker.c:2564
----------
Process 11393
(gdb) bt
#0  0x00007fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007fa197917e44 in sleep () from /lib64/libc.so.6
#2  0x0000000000950f84 in DeadLockReport () at deadlock.c:1151
#3  0x0000000000955013 in WaitOnLock (locallock=0x2fd05d0,
owner=0x2fd9a48) at lock.c:1873
#4  0x0000000000953c14 in LockAcquireExtended (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x0) at lock.c:1100
#5  0x00000000009531bc in LockAcquire (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false) at lock.c:751
#6  0x0000000000951d86 in XactLockTableWait (xid=597,
rel=0x7fa1986e9e08, ctid=0x7ffed5fe0284, oper=XLTW_Update) at
lmgr.c:674
#7  0x00000000004f84d8 in heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, newtup=0x3067dc0, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffed5fe03b0, lockmode=0x7ffed5fe03ac) at heapam.c:3549
#8  0x00000000004fa5dd in simple_heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at heapam.c:4222
#9  0x00000000005c9932 in CatalogTupleUpdate (heapRel=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at indexing.c:312
#10 0x0000000000af5774 in RelationSetNewRelfilenode
(relation=0x7fa1986fdc80, persistence=112 'p') at relcache.c:3707
#11 0x00000000006d2b4d in ExecuteTruncateGuts
(explicit_rels=0x30688b8, relids=0x3068910, relids_extra=0x3068968,
relids_logged=0x30689c0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2012
#12 0x00000000008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x00000000008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
----------

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time. The
code of apply_handle_truncate function (worker.c) has a very similar
pattern to the ExecuteTruncate function (tablecmds.c) but the
ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
apply_handle_truncate was using.

~~

PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
same as the ExecuteTruncate function does.

PSA a patch adding a test for this scenario.

--------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: making update/delete of inheritance trees scale better
Next
From: torikoshia
Date:
Subject: Re: wal stats questions