Thread: LWLock deadlock and gdb advice
I have a 9.5alpha1 cluster which is locked up. All the user back ends seem to be waiting on semop, eventually on WALInsertLockAcquire.
Is there a way to use gdb to figure out who holds the lock they are waiting for?
It is compiled with both debug and cassert.
I am hoping someone can give me recipe similar to the one for Examining backend memory use in https://wiki.postgresql.org/wiki/Developer_FAQ
example backtrace:
#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x000000000067190f in PGSemaphoreLock (sema=0x7f28a98b9468) at pg_sema.c:387
#2 0x00000000006d4b0c in LWLockAcquireCommon (l=0x7f28a0e6d600, valptr=0x7f28a0e6d618, val=0) at lwlock.c:1042
#3 LWLockAcquireWithVar (l=0x7f28a0e6d600, valptr=0x7f28a0e6d618, val=0) at lwlock.c:916
#4 0x00000000004f3c4f in WALInsertLockAcquire (rdata=0xc5c130, fpw_lsn=0) at xlog.c:1411
#5 XLogInsertRecord (rdata=0xc5c130, fpw_lsn=0) at xlog.c:948
#6 0x00000000004f7aac in XLogInsert (rmid=13 '\r', info=32 ' ') at xloginsert.c:453
#7 0x000000000047e0b0 in ginPlaceToPage (btree=0x7fffca9263e0, stack=0x2c94ff8, insertdata=<value optimized out>, updateblkno=<value optimized out>, childbuf=0, buildStats=0x0)
at ginbtree.c:418
#8 0x000000000047f3ad in ginInsertValue (btree=0x7fffca9263e0, stack=0x2c94ff8, insertdata=0x7fffca926460, buildStats=0x0) at ginbtree.c:748
#9 0x0000000000475c8b in ginEntryInsert (ginstate=0x7fffca9267e0, attnum=29784, key=1, category=<value optimized out>, items=0x7f28a0c7b458, nitem=47, buildStats=0x0)
at gininsert.c:234
#10 0x0000000000485ecc in ginInsertCleanup (ginstate=0x7fffca9267e0, vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843
#11 0x0000000000487059 in ginHeapTupleFastInsert (ginstate=0x7fffca9267e0, collector=<value optimized out>) at ginfast.c:436
#12 0x00000000004760fa in gininsert (fcinfo=<value optimized out>) at gininsert.c:531
Cheers,
Jeff
On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > Is there a way to use gdb to figure out who holds the lock they are waiting > for? Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG defined? That might do it. Otherwise, I suggest dereferencing the "l" argument to LWLockAcquireWithVar() or something -- set the frame to "3" in your example backtrace, using "f 3". Then, "p *l". You'll get the tranche id there, and so on. -- Peter Geoghegan
On Tue, Jun 30, 2015 at 6:25 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > Is there a way to use gdb to figure out who holds the lock they are waiting
> > for?
>
> Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
> defined? That might do it.
>
>
> On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > Is there a way to use gdb to figure out who holds the lock they are waiting
> > for?
>
> Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
> defined? That might do it.
>
If you define LOCK_DEBUG, then you can check owner of the
lock [1], which will tell you about the Exclusive owner of that lock
and can help you in debugging the problem.
[1]
#ifdef LOCK_DEBUG
struct PGPROC *owner; /* last exlusive owner of the lock */
#endif
} LWLock;
struct PGPROC *owner; /* last exlusive owner of the lock */
#endif
} LWLock;
On Mon, Jun 29, 2015 at 5:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Is there a way to use gdb to figure out who holds the lock they are waiting
> for?
Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
defined? That might do it.
I hadn't planned on running into this problem, so had not compiled accordingly.
I thought LOCK_DEBUG would produce too much output, but now I see it doesn't print anything unless Trace_lwlocks is on (but it still makes more info available to gdb, as Amit mentioned), so I think that should be OK.
Since I messed up the gdb session causing the postmaster to SIGKILL all the children and destroy the evidence, I'll repeat the run compiled with LOCK_DEBUG and see what it looks like in the morning.
Thanks,
Jeff
On Mon, Jun 29, 2015 at 11:28 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jun 29, 2015 at 5:55 PM, Peter Geoghegan <pg@heroku.com> wrote:On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Is there a way to use gdb to figure out who holds the lock they are waiting
> for?
Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
defined? That might do it.I hadn't planned on running into this problem, so had not compiled accordingly.I thought LOCK_DEBUG would produce too much output, but now I see it doesn't print anything unless Trace_lwlocks is on (but it still makes more info available to gdb, as Amit mentioned), so I think that should be OK.Since I messed up the gdb session causing the postmaster to SIGKILL all the children and destroy the evidence, I'll repeat the run compiled with LOCK_DEBUG and see what it looks like in the morning.
I've gotten the LWLock deadlock again. User backend 24841 holds the WALInsertLocks 7 and is blocked attempting to acquire 6 . So it seems to be violating the lock ordering rules (although I don't see that rule spelled out in xlog.c)
The Checkpoint process, meanwhile, has acquired WALInsertLocks 0 through 6 and is blocked on 7.
I'm not sure where to go from here.
24841:
(gdb) f 2
(gdb) p held_lwlocks
$1 = {{lock = 0x7f70e11fae40, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13df080, mode = LW_EXCLUSIVE}, {lock = 0x7f70e11d4280, mode = LW_EXCLUSIVE}, {lock = 0x7f70e11d4280,
mode = LW_EXCLUSIVE}, {lock = 0x0, mode = LW_EXCLUSIVE} <repeats 196 times>}
(gdb) p T_NAME((LWLock *) held_lwlocks[1])
$2 = 0x84cee1 "WALInsertLocks"
(gdb) p T_ID((LWLock *) held_lwlocks[1])
(gdb) p lock
$3 = (LWLock *) 0x7f70e13df000
(gdb) p T_NAME((LWLock *) lock)
$7 = 0x84cee1 "WALInsertLocks"
(gdb) p T_ID((LWLock *) lock)
$8 = 6
(gdb) p *(lock)
$1 = {mutex = 0 '\000', tranche = 1, state = {value = 1627389952}, nwaiters = {value = 2}, waiters = {head = {prev = 0x7f70e9e29e58, next = 0x7f70e9e2a140}},
owner = 0x7f70e9e2d260}
(gdb) p *(lock->owner)
$5 = {links = {prev = 0x0, next = 0x0}, sem = {semId = 539754537, semNum = 0}, waitStatus = 0, procLatch = {is_set = 0, is_shared = 1 '\001', owner_pid = 24820}, lxid = 0,
pid = 24820, pgprocno = 112, backendId = -1, databaseId = 0, roleId = 0, recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 0 '\000', lwWaitLink = {
prev = 0x7f70e13df090, next = 0x7f70e13df090}, waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0,
next = 0x0}, myProcLocks = {{prev = 0x7f70e9e2d2f0, next = 0x7f70e9e2d2f0}, {prev = 0x7f70e9e2d300, next = 0x7f70e9e2d300}, {prev = 0x7f70e9e2d310, next = 0x7f70e9e2d310}, {
prev = 0x7f70e9e2d320, next = 0x7f70e9e2d320}, {prev = 0x7f70e9e2d330, next = 0x7f70e9e2d330}, {prev = 0x7f70e9e2d340, next = 0x7f70e9e2d340}, {prev = 0x7f70e9e2d350,
next = 0x7f70e9e2d350}, {prev = 0x7f70e9e2d360, next = 0x7f70e9e2d360}, {prev = 0x7f70e9e2d370, next = 0x7f70e9e2d370}, {prev = 0x7f70e9e2d380, next = 0x7f70e9e2d380}, {
prev = 0x7f70e9e2d390, next = 0x7f70e9e2d390}, {prev = 0x7f70e9e2d3a0, next = 0x7f70e9e2d3a0}, {prev = 0x7f70e9e2d3b0, next = 0x7f70e9e2d3b0}, {prev = 0x7f70e9e2d3c0,
next = 0x7f70e9e2d3c0}, {prev = 0x7f70e9e2d3d0, next = 0x7f70e9e2d3d0}, {prev = 0x7f70e9e2d3e0, next = 0x7f70e9e2d3e0}}, subxids = {xids = {0 <repeats 64 times>}},
backendLock = 0x7f70e13dad00, fpLockBits = 0, fpRelId = {0 <repeats 16 times>}, fpVXIDLock = 0 '\000', fpLocalTransactionId = 0}
#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x0000000000671aef in PGSemaphoreLock (sema=0x7f70e9e2a108) at pg_sema.c:387
#2 0x00000000006d79ac in LWLockWaitForVar (lock=0x7f70e13df000, valptr=0x7f70e13df028, oldval=0, newval=0x7fffb33cbd48) at lwlock.c:1406
#3 0x00000000004ee006 in WaitXLogInsertionsToFinish (upto=41590734848) at xlog.c:1576
#4 0x00000000004ee94b in AdvanceXLInsertBuffer (upto=41594920960, opportunistic=<value optimized out>) at xlog.c:1888
#5 0x00000000004f3c42 in GetXLogBuffer (ptr=41594920960) at xlog.c:1669
#6 0x00000000004f40e7 in CopyXLogRecordToWAL (rdata=0x19bc920, fpw_lsn=<value optimized out>) at xlog.c:1313
#7 XLogInsertRecord (rdata=0x19bc920, fpw_lsn=<value optimized out>) at xlog.c:1008
#8 0x00000000004f7c6c in XLogInsert (rmid=13 '\r', info=32 ' ') at xloginsert.c:453
#9 0x000000000047e210 in ginPlaceToPage (btree=0x7fffb33cc070, stack=0x1a5bfe0, insertdata=<value optimized out>, updateblkno=<value optimized out>, childbuf=0, buildStats=0x0)
at ginbtree.c:418
#10 0x000000000047f50d in ginInsertValue (btree=0x7fffb33cc070, stack=0x1a5bfe0, insertdata=0x7fffb33cc0f0, buildStats=0x0) at ginbtree.c:748
#11 0x0000000000475deb in ginEntryInsert (ginstate=0x7fffb33cc470, attnum=5544, key=1, category=<value optimized out>, items=0x7f70e10d9140, nitem=51, buildStats=0x0)
at gininsert.c:234
#12 0x000000000048602c in ginInsertCleanup (ginstate=0x7fffb33cc470, vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843
#13 0x00000000004871b9 in ginHeapTupleFastInsert (ginstate=0x7fffb33cc470, collector=<value optimized out>) at ginfast.c:436
#14 0x000000000047625a in gininsert (fcinfo=<value optimized out>) at gininsert.c:531
#15 0x00000000007dc483 in FunctionCall6Coll (flinfo=<value optimized out>, collation=<value optimized out>, arg1=<value optimized out>, arg2=<value optimized out>,
arg3=<value optimized out>, arg4=<value optimized out>, arg5=140122789534360, arg6=0) at fmgr.c:1436
#16 0x00000000004b2247 in index_insert (indexRelation=0x7f70e1190e60, values=0x7fffb33cef50, isnull=0x7fffb33cf050 "", heap_t_ctid=0x1a4a794, heapRelation=0x7f70e1185a98,
checkUnique=UNIQUE_CHECK_NO) at indexam.c:226
#17 0x00000000005d2e67 in ExecInsertIndexTuples (slot=0x1a497f0, tupleid=0x1a4a794, estate=0x1a40768, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0)
at execIndexing.c:384
#18 0x00000000005f0161 in ExecUpdate (tupleid=0x7fffb33cf250, oldtuple=0x0, slot=0x1a497f0, planSlot=0x1a42498, epqstate=0x1a40a70, estate=0x1a40768, canSetTag=1 '\001')
at nodeModifyTable.c:978
#19 0x00000000005f0b2a in ExecModifyTable (node=0x1a409d0) at nodeModifyTable.c:1436
#20 0x00000000005d6cc8 in ExecProcNode (node=0x1a409d0) at execProcnode.c:389
#21 0x00000000005d5402 in ExecutePlan (queryDesc=0x1a2b220, direction=<value optimized out>, count=0) at execMain.c:1549
#22 standard_ExecutorRun (queryDesc=0x1a2b220, direction=<value optimized out>, count=0) at execMain.c:337
#23 0x00007f70ea39af3b in pgss_ExecutorRun (queryDesc=0x1a2b220, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:881
#24 0x00000000006ea38f in ProcessQuery (plan=0x1a2b3d8, sourceText=0x1a2af68 "update foo set count=count+1 where text_array @> $1", params=0x1a2afe8, dest=<value optimized out>,
completionTag=0x7fffb33cf660 "") at pquery.c:185
#25 0x00000000006ea5ec in PortalRunMulti (portal=0x19a6128, isTopLevel=1 '\001', dest=0xc55020, altdest=0xc55020, completionTag=0x7fffb33cf660 "") at pquery.c:1279
#26 0x00000000006eadb3 in PortalRun (portal=0x19a6128, count=9223372036854775807, isTopLevel=1 '\001', dest=0x1990630, altdest=0x1990630, completionTag=0x7fffb33cf660 "")
at pquery.c:816
#27 0x00000000006e6ffb in exec_execute_message (portal_name=0x1990218 "", max_rows=9223372036854775807) at postgres.c:1988
#28 0x00000000006e8c15 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x19a3210 "jjanes", username=<value optimized out>) at postgres.c:4088
#29 0x00000000006855dd in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4159
#30 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3835
#31 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1609
#32 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1254
#33 0x0000000000610ab0 in main (argc=4, argv=0x1976cf0) at main.c:221
24820:
(gdb) p held_lwlocks
$1 = {{lock = 0x7f70e11d4340, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13ded00, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13ded80, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13dee00,
mode = LW_EXCLUSIVE}, {lock = 0x7f70e13dee80, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13def00, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13def80, mode = LW_EXCLUSIVE}, {
lock = 0x7f70e13df000, mode = LW_EXCLUSIVE}, {lock = 0x0, mode = LW_EXCLUSIVE} <repeats 192 times>}
(gdb) p lock
$3 = (LWLock *) 0x7f70e13df080
(gdb) p *lock
$4 = {mutex = 0 '\000', tranche = 1, state = {value = 1627389952}, nwaiters = {value = 1}, waiters = {head = {prev = 0x7f70e9e2d2a8, next = 0x7f70e9e2d2a8}},
owner = 0x7f70e9e2a0f8}
(gdb) p *(lock->owner)
$5 = {links = {prev = 0x0, next = 0x0}, sem = {semId = 539688999, semNum = 15}, waitStatus = 0, procLatch = {is_set = 1, is_shared = 1 '\001', owner_pid = 24841}, lxid = 5523,
pid = 24841, pgprocno = 95, backendId = 9, databaseId = 16384, roleId = 10, recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 2 '\002', lwWaitLink = {
prev = 0x7f70e13df010, next = 0x7f70e9e29e58}, waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0,
next = 0x0}, myProcLocks = {{prev = 0x7f70e9e2a188, next = 0x7f70e9e2a188}, {prev = 0x7f70e9e2a198, next = 0x7f70e9e2a198}, {prev = 0x7f70e9e2a1a8, next = 0x7f70e9e2a1a8}, {
prev = 0x7f70e9bfe060, next = 0x7f70e9bfe060}, {prev = 0x7f70e9e2a1c8, next = 0x7f70e9e2a1c8}, {prev = 0x7f70e9e2a1d8, next = 0x7f70e9e2a1d8}, {prev = 0x7f70e9e2a1e8,
next = 0x7f70e9e2a1e8}, {prev = 0x7f70e9e2a1f8, next = 0x7f70e9e2a1f8}, {prev = 0x7f70e9e2a208, next = 0x7f70e9e2a208}, {prev = 0x7f70e9e2a218, next = 0x7f70e9e2a218}, {
prev = 0x7f70e9e2a228, next = 0x7f70e9e2a228}, {prev = 0x7f70e9e2a238, next = 0x7f70e9e2a238}, {prev = 0x7f70e9e2a248, next = 0x7f70e9e2a248}, {prev = 0x7f70e9e2a258,
next = 0x7f70e9e2a258}, {prev = 0x7f70e9e2a268, next = 0x7f70e9e2a268}, {prev = 0x7f70e9e2a278, next = 0x7f70e9e2a278}}, subxids = {xids = {0 <repeats 64 times>}},
backendLock = 0x7f70e13da8c0, fpLockBits = 160528697655296, fpRelId = {0, 0, 0, 0, 0, 0, 0, 0, 0, 2679, 2610, 16414, 16412, 16416, 16415, 16409}, fpVXIDLock = 1 '\001',
fpLocalTransactionId = 5523}
(gdb) bt
#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x0000000000671aef in PGSemaphoreLock (sema=0x7f70e9e2d270) at pg_sema.c:387
#2 0x00000000006d70ac in LWLockAcquireCommon (lock=0x7f70e13df080, mode=LW_EXCLUSIVE, valptr=0x7f70e13df0a8, val=0) at lwlock.c:1042
#3 0x00000000004e6dab in WALInsertLockAcquireExclusive () at xlog.c:1449
#4 0x00000000004ef025 in CreateCheckPoint (flags=64) at xlog.c:8177
#5 0x00000000006789d3 in CheckpointerMain () at checkpointer.c:493
#6 0x00000000004fecec in AuxiliaryProcessMain (argc=2, argv=0x7fffb33cf1d0) at bootstrap.c:428
#7 0x0000000000680697 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5143
#8 0x0000000000682b7a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2656
#9 <signal handler called>
#10 0x0000003dcb6e1353 in __select_nocancel () from /lib64/libc.so.6
#11 0x00000000006849bf in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1573
#12 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1254
#13 0x0000000000610ab0 in main (argc=4, argv=0x1976cf0) at main.c:221
Thanks,Jeff
Jeff Janes wrote: > I've gotten the LWLock deadlock again. User backend 24841 holds the > WALInsertLocks 7 and is blocked attempting to acquire 6 . So it seems to > be violating the lock ordering rules (although I don't see that rule > spelled out in xlog.c) Hmm, interesting -- pg_stat_statement is trying to re-do an operation that involves updating a GIN index, while WAL-logging for the original update is still ongoing, it seems. > #0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6 > #1 0x0000000000671aef in PGSemaphoreLock (sema=0x7f70e9e2a108) at > pg_sema.c:387 > #2 0x00000000006d79ac in LWLockWaitForVar (lock=0x7f70e13df000, > valptr=0x7f70e13df028, oldval=0, newval=0x7fffb33cbd48) at lwlock.c:1406 > #3 0x00000000004ee006 in WaitXLogInsertionsToFinish (upto=41590734848) at > xlog.c:1576 > #4 0x00000000004ee94b in AdvanceXLInsertBuffer (upto=41594920960, > opportunistic=<value optimized out>) at xlog.c:1888 > #5 0x00000000004f3c42 in GetXLogBuffer (ptr=41594920960) at xlog.c:1669 > #6 0x00000000004f40e7 in CopyXLogRecordToWAL (rdata=0x19bc920, > fpw_lsn=<value optimized out>) at xlog.c:1313 > #7 XLogInsertRecord (rdata=0x19bc920, fpw_lsn=<value optimized out>) at > xlog.c:1008 > #8 0x00000000004f7c6c in XLogInsert (rmid=13 '\r', info=32 ' ') at > xloginsert.c:453 > #9 0x000000000047e210 in ginPlaceToPage (btree=0x7fffb33cc070, > stack=0x1a5bfe0, insertdata=<value optimized out>, updateblkno=<value > optimized out>, childbuf=0, buildStats=0x0) > at ginbtree.c:418 > #10 0x000000000047f50d in ginInsertValue (btree=0x7fffb33cc070, > stack=0x1a5bfe0, insertdata=0x7fffb33cc0f0, buildStats=0x0) at > ginbtree.c:748 > #11 0x0000000000475deb in ginEntryInsert (ginstate=0x7fffb33cc470, > attnum=5544, key=1, category=<value optimized out>, items=0x7f70e10d9140, > nitem=51, buildStats=0x0) > at gininsert.c:234 > #12 0x000000000048602c in ginInsertCleanup (ginstate=0x7fffb33cc470, > vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843 > #13 0x00000000004871b9 in ginHeapTupleFastInsert (ginstate=0x7fffb33cc470, > collector=<value optimized out>) at ginfast.c:436 > #14 0x000000000047625a in gininsert (fcinfo=<value optimized out>) at > gininsert.c:531 > #15 0x00000000007dc483 in FunctionCall6Coll (flinfo=<value optimized out>, > collation=<value optimized out>, arg1=<value optimized out>, arg2=<value > optimized out>, > arg3=<value optimized out>, arg4=<value optimized out>, > arg5=140122789534360, arg6=0) at fmgr.c:1436 > #16 0x00000000004b2247 in index_insert (indexRelation=0x7f70e1190e60, > values=0x7fffb33cef50, isnull=0x7fffb33cf050 "", heap_t_ctid=0x1a4a794, > heapRelation=0x7f70e1185a98, > checkUnique=UNIQUE_CHECK_NO) at indexam.c:226 > #17 0x00000000005d2e67 in ExecInsertIndexTuples (slot=0x1a497f0, > tupleid=0x1a4a794, estate=0x1a40768, noDupErr=0 '\000', specConflict=0x0, > arbiterIndexes=0x0) > at execIndexing.c:384 > #18 0x00000000005f0161 in ExecUpdate (tupleid=0x7fffb33cf250, oldtuple=0x0, > slot=0x1a497f0, planSlot=0x1a42498, epqstate=0x1a40a70, estate=0x1a40768, > canSetTag=1 '\001') > at nodeModifyTable.c:978 > #19 0x00000000005f0b2a in ExecModifyTable (node=0x1a409d0) at > nodeModifyTable.c:1436 > #20 0x00000000005d6cc8 in ExecProcNode (node=0x1a409d0) at > execProcnode.c:389 > #21 0x00000000005d5402 in ExecutePlan (queryDesc=0x1a2b220, > direction=<value optimized out>, count=0) at execMain.c:1549 > #22 standard_ExecutorRun (queryDesc=0x1a2b220, direction=<value optimized > out>, count=0) at execMain.c:337 > #23 0x00007f70ea39af3b in pgss_ExecutorRun (queryDesc=0x1a2b220, > direction=ForwardScanDirection, count=0) at pg_stat_statements.c:881 > #24 0x00000000006ea38f in ProcessQuery (plan=0x1a2b3d8, > sourceText=0x1a2af68 "update foo set count=count+1 where text_array @> $1", > params=0x1a2afe8, dest=<value optimized out>, > completionTag=0x7fffb33cf660 "") at pquery.c:185 > #25 0x00000000006ea5ec in PortalRunMulti (portal=0x19a6128, isTopLevel=1 > '\001', dest=0xc55020, altdest=0xc55020, completionTag=0x7fffb33cf660 "") > at pquery.c:1279 > #26 0x00000000006eadb3 in PortalRun (portal=0x19a6128, > count=9223372036854775807, isTopLevel=1 '\001', dest=0x1990630, > altdest=0x1990630, completionTag=0x7fffb33cf660 "") > at pquery.c:816 > #27 0x00000000006e6ffb in exec_execute_message (portal_name=0x1990218 "", > max_rows=9223372036854775807) at postgres.c:1988 > #28 0x00000000006e8c15 in PostgresMain (argc=<value optimized out>, > argv=<value optimized out>, dbname=0x19a3210 "jjanes", username=<value > optimized out>) at postgres.c:4088 > #29 0x00000000006855dd in BackendRun (argc=<value optimized out>, > argv=<value optimized out>) at postmaster.c:4159 > #30 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) > at postmaster.c:3835 > #31 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at > postmaster.c:1609 > #32 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) > at postmaster.c:1254 > #33 0x0000000000610ab0 in main (argc=4, argv=0x1976cf0) at main.c:221 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/30/2015 07:05 PM, Jeff Janes wrote: > On Mon, Jun 29, 2015 at 11:28 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > >> On Mon, Jun 29, 2015 at 5:55 PM, Peter Geoghegan <pg@heroku.com> wrote: >> >>> On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>>> Is there a way to use gdb to figure out who holds the lock they are >>> waiting >>>> for? >>> >>> Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG >>> defined? That might do it. >>> >> >> I hadn't planned on running into this problem, so had not compiled >> accordingly. >> >> I thought LOCK_DEBUG would produce too much output, but now I see it >> doesn't print anything unless Trace_lwlocks is on (but it still makes more >> info available to gdb, as Amit mentioned), so I think that should be OK. >> >> Since I messed up the gdb session causing the postmaster to SIGKILL all >> the children and destroy the evidence, I'll repeat the run compiled with >> LOCK_DEBUG and see what it looks like in the morning. >> > > I've gotten the LWLock deadlock again. User backend 24841 holds the > WALInsertLocks 7 and is blocked attempting to acquire 6 . So it seems to > be violating the lock ordering rules (although I don't see that rule > spelled out in xlog.c) > > The Checkpoint process, meanwhile, has acquired WALInsertLocks 0 through 6 > and is blocked on 7. > > I'm not sure where to go from here. The user backend 24841 is waiting in a LWLockWaitForVar() call, on WALInsertLock 6, and oldval==0. The checkpointer is holding WALInsertLock 6, but it has set the insertingat value to PG_UINT64_MAX. The LWLockWaitForVar() call should not be blocked on that, because 0 != PG_UINT64_MAX. Looks like the LWLock-scalability patch that made LWlock acquisiton to use atomic ops instead of a spinlock broke this. LWLockWaitForVar() is supposed to guarantee that it always wakes up from sleep, when the variable's value changes. But there is now a race condition in LWLockWaitForVar that wasn't there in 9.4: > if (mustwait) > { > /* > * Perform comparison using spinlock as we can't rely on atomic 64 > * bit reads/stores. > */ > #ifdef LWLOCK_STATS > lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex); > #else > SpinLockAcquire(&lock->mutex); > #endif > > /* > * XXX: We can significantly optimize this on platforms with 64bit > * atomics. > */ > value = *valptr; > if (value != oldval) > { > result = false; > mustwait = false; > *newval = value; > } > else > mustwait = true; > SpinLockRelease(&lock->mutex); > } > else > mustwait = false; > > if (!mustwait) > break; /* the lock was free or value didn't match */ > > /* > * Add myself to wait queue. Note that this is racy, somebody else > * could wakeup before we're finished queuing. NB: We're using nearly > * the same twice-in-a-row lock acquisition protocol as > * LWLockAcquire(). Check its comments for details. > */ > LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE, false); After the spinlock is released above, but before the LWLockQueueSelf() call, it's possible that another backend comes in, acquires the lock, changes the variable's value, and releases the lock again. In 9.4, the spinlock was not released until the process was queued. Looking at LWLockAcquireWithVar(), it's also no longer holding the spinlock while it updates the Var. That's not cool either :-(. - Heikki
On 06/30/2015 07:37 PM, Alvaro Herrera wrote: > Jeff Janes wrote: > >> I've gotten the LWLock deadlock again. User backend 24841 holds the >> WALInsertLocks 7 and is blocked attempting to acquire 6 . So it seems to >> be violating the lock ordering rules (although I don't see that rule >> spelled out in xlog.c) > > Hmm, interesting -- pg_stat_statement is trying to re-do an operation > that involves updating a GIN index, while WAL-logging for the original > update is still ongoing, it seems. I don't think pg_stat_statement has anything to do with this. You can see from the backtrace that pg_stat_statement is enabled, as the call went through the pgss_ExecutorRun executor hook, but that's all. - Heikki
On 2015-06-30 21:08:53 +0300, Heikki Linnakangas wrote: > > /* > > * XXX: We can significantly optimize this on platforms with 64bit > > * atomics. > > */ > > value = *valptr; > > if (value != oldval) > > { > > result = false; > > mustwait = false; > > *newval = value; > > } > > else > > mustwait = true; > > SpinLockRelease(&lock->mutex); > > } > > else > > mustwait = false; > > > > if (!mustwait) > > break; /* the lock was free or value didn't match */ > > > > /* > > * Add myself to wait queue. Note that this is racy, somebody else > > * could wakeup before we're finished queuing. NB: We're using nearly > > * the same twice-in-a-row lock acquisition protocol as > > * LWLockAcquire(). Check its comments for details. > > */ > > LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE, false); > > After the spinlock is released above, but before the LWLockQueueSelf() call, > it's possible that another backend comes in, acquires the lock, changes the > variable's value, and releases the lock again. In 9.4, the spinlock was not > released until the process was queued. Hm. Right. A recheck of the value after the queuing should be sufficient to fix? That's how we deal with the exact same scenarios for the normal lock state, so that shouldn't be very hard to add. > Looking at LWLockAcquireWithVar(), it's also no longer holding the spinlock > while it updates the Var. That's not cool either :-(. Hm. I'd somehow assumed holding the lwlock ought to be sufficient, but it really isn't. This var stuff isn't fitting all that well into the lwlock code. Greetings, Andres Freund
On 06/30/2015 10:09 PM, Andres Freund wrote: > On 2015-06-30 21:08:53 +0300, Heikki Linnakangas wrote: >>> /* >>> * XXX: We can significantly optimize this on platforms with 64bit >>> * atomics. >>> */ >>> value = *valptr; >>> if (value != oldval) >>> { >>> result = false; >>> mustwait = false; >>> *newval = value; >>> } >>> else >>> mustwait = true; >>> SpinLockRelease(&lock->mutex); >>> } >>> else >>> mustwait = false; >>> >>> if (!mustwait) >>> break; /* the lock was free or value didn't match */ >>> >>> /* >>> * Add myself to wait queue. Note that this is racy, somebody else >>> * could wakeup before we're finished queuing. NB: We're using nearly >>> * the same twice-in-a-row lock acquisition protocol as >>> * LWLockAcquire(). Check its comments for details. >>> */ >>> LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE, false); >> >> After the spinlock is released above, but before the LWLockQueueSelf() call, >> it's possible that another backend comes in, acquires the lock, changes the >> variable's value, and releases the lock again. In 9.4, the spinlock was not >> released until the process was queued. > > Hm. Right. A recheck of the value after the queuing should be sufficient > to fix? That's how we deal with the exact same scenarios for the normal > lock state, so that shouldn't be very hard to add. Yeah. It's probably more efficient to not release the spinlock between checking the value and LWLockQueueSelf() though. >> Looking at LWLockAcquireWithVar(), it's also no longer holding the spinlock >> while it updates the Var. That's not cool either :-(. > > Hm. I'd somehow assumed holding the lwlock ought to be sufficient, but > it really isn't. This var stuff isn't fitting all that well into the > lwlock code. I'll take a stab at fixing this tomorrow. I take the blame for not documenting the semantics of LWLockAcquireWithVar() and friends well enough... - Heikki
On 2015-06-30 22:19:02 +0300, Heikki Linnakangas wrote: > >Hm. Right. A recheck of the value after the queuing should be sufficient > >to fix? That's how we deal with the exact same scenarios for the normal > >lock state, so that shouldn't be very hard to add. > > Yeah. It's probably more efficient to not release the spinlock between > checking the value and LWLockQueueSelf() though. Right now LWLockQueueSelf() takes spinlocks etc itself, and is used that way in a bunch of callsites... So that'd be harder. Additionally I'm planning to get rid of the spinlocks around queuing (they show up as bottlenecks in contended workloads), so it seems more future proof not to assume that either way. On top of that I think we should, when available (or using the same type of fallback as for 32bit?), use 64 bit atomics for the var anyway. > I'll take a stab at fixing this tomorrow. Thanks! Do you mean both or "just" the second issue? Greetings, Andres Freund
On 06/30/2015 11:24 PM, Andres Freund wrote: > On 2015-06-30 22:19:02 +0300, Heikki Linnakangas wrote: >>> Hm. Right. A recheck of the value after the queuing should be sufficient >>> to fix? That's how we deal with the exact same scenarios for the normal >>> lock state, so that shouldn't be very hard to add. >> >> Yeah. It's probably more efficient to not release the spinlock between >> checking the value and LWLockQueueSelf() though. > > Right now LWLockQueueSelf() takes spinlocks etc itself, and is used that > way in a bunch of callsites... So that'd be harder. Additionally I'm > planning to get rid of the spinlocks around queuing (they show up as > bottlenecks in contended workloads), so it seems more future proof not > to assume that either way. On top of that I think we should, when > available (or using the same type of fallback as for 32bit?), use 64 bit > atomics for the var anyway. > >> I'll take a stab at fixing this tomorrow. > > Thanks! Do you mean both or "just" the second issue? Both. Here's the patch. Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet. This passes make check, but I haven't done any testing beyond that. Does this look sane to you? - Heikki
Attachment
On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Both. Here's the patch.
Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet.
I ran this for a while without casserts and it seems to work. But with casserts, I get failures in the autovac process on the GIN index.
I don't see how this is related to the LWLock issue, but I didn't see it without your patch. Perhaps the system just didn't survive long enough to uncover it without the patch (although it shows up pretty quickly). It could just be an overzealous Assert, since the casserts off didn't show problems.
bt and bt full are shown below.
Cheers,
Jeff
#0 0x0000003dcb632625 in raise () from /lib64/libc.so.6
#1 0x0000003dcb633e05 in abort () from /lib64/libc.so.6
#2 0x0000000000930b7a in ExceptionalCondition (
conditionName=0x9a1440 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9a12bc "FailedAssertion",
fileName=0x9a12b0 "ginvacuum.c", lineNumber=713) at assert.c:54
#3 0x00000000004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at ginvacuum.c:713
#4 0x000000000093b53a in FunctionCall2Coll (flinfo=0x7fffee073e60, collation=0, arg1=140737186840448, arg2=0) at fmgr.c:1323
#5 0x00000000004d5c7a in index_vacuum_cleanup (info=0x7fffee073f80, stats=0x0) at indexam.c:717
#6 0x0000000000664f69 in lazy_cleanup_index (indrel=0x7faafbcace20, stats=0x0, vacrelstats=0x28809c8) at vacuumlazy.c:1400
#7 0x0000000000664142 in lazy_scan_heap (onerel=0x7faafbcab6d0, vacrelstats=0x28809c8, Irel=0x2881090, nindexes=2, scan_all=1 '\001') at vacuumlazy.c:1111
#8 0x0000000000662905 in lazy_vacuum_rel (onerel=0x7faafbcab6d0, options=65, params=0x286ea00, bstrategy=0x286ea90) at vacuumlazy.c:247
#9 0x00000000006623e4 in vacuum_rel (relid=16409, relation=0x7fffee074550, options=65, params=0x286ea00) at vacuum.c:1377
#10 0x0000000000660bea in vacuum (options=65, relation=0x7fffee074550, relid=16409, params=0x286ea00, va_cols=0x0, bstrategy=0x286ea90, isTopLevel=1 '\001')
at vacuum.c:295
#11 0x000000000075caa9 in autovacuum_do_vac_analyze (tab=0x286e9f8, bstrategy=0x286ea90) at autovacuum.c:2811
#12 0x000000000075be67 in do_autovacuum () at autovacuum.c:2331
#13 0x000000000075ac1c in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1648
#14 0x000000000075a84d in StartAutoVacWorker () at autovacuum.c:1455
#15 0x000000000076f745 in StartAutovacuumWorker () at postmaster.c:5261
#16 0x000000000076effd in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:4918
#17 <signal handler called>
#18 0x0000003dcb6e1353 in __select_nocancel () from /lib64/libc.so.6
#19 0x000000000076a8f0 in ServerLoop () at postmaster.c:1582
#20 0x000000000076a141 in PostmasterMain (argc=4, argv=0x27b2cf0) at postmaster.c:1263
#21 0x00000000006c1e6e in main (argc=4, argv=0x27b2cf0) at main.c:223
#0 0x0000003dcb632625 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000003dcb633e05 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000000930b7a in ExceptionalCondition (
conditionName=0x9a1440 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9a12bc "FailedAssertion",
fileName=0x9a12b0 "ginvacuum.c", lineNumber=713) at assert.c:54
No locals.
#3 0x00000000004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at ginvacuum.c:713
buffer = 186
page = 0x7faafc565b20 ""
info = 0x7fffee073f80
stats = 0x2880858
index = 0x7faafbcace20
needLock = 1 '\001'
npages = 22569
blkno = 12025
totFreePages = 11502
ginstate = {index = 0x7faafbcace20, oneCol = 1 '\001', origTupdesc = 0x7faafbcad150, tupdesc = {0x7faafbcad150, 0x0 <repeats 31 times>},
compareFn = {{fn_addr = 0x90224d <bttextcmp>, fn_oid = 360, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x285e3c0, fn_expr = 0x0}, {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000',
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0} <repeats 31 times>}, extractValueFn = {{
fn_addr = 0x494adc <ginarrayextract>, fn_oid = 2743, fn_nargs = 3, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x285e3c0, fn_expr = 0x0}, {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000',
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0} <repeats 31 times>}, extractQueryFn = {{
fn_addr = 0x494beb <ginqueryarrayextract>, fn_oid = 2774, fn_nargs = 7, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x285e3c0, fn_expr = 0x0}, {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000',
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0} <repeats 31 times>}, consistentFn = {{
fn_addr = 0x494d67 <ginarrayconsistent>, fn_oid = 2744, fn_nargs = 8, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x285e3c0, fn_expr = 0x0}, {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000',
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0} <repeats 31 times>}, triConsistentFn = {{
fn_addr = 0x494efd <ginarraytriconsistent>, fn_oid = 3920, fn_nargs = 7, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x285e3c0, fn_expr = 0x0}, {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000',
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0} <repeats 31 times>}, comparePartialFn = {{fn_addr = 0, fn_oid = 0,
fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0,
fn_expr = 0x0} <repeats 32 times>}, canPartialMatch = '\000' <repeats 31 times>, supportCollation = {100, 0 <repeats 31 times>}}
idxStat = {nPendingPages = 0, nTotalPages = 0, nEntryPages = 522, nDataPages = 0, nEntries = 11000, ginVersion = 0}
On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 06/30/2015 11:24 PM, Andres Freund wrote:On 2015-06-30 22:19:02 +0300, Heikki Linnakangas wrote:Hm. Right. A recheck of the value after the queuing should be sufficient
to fix? That's how we deal with the exact same scenarios for the normal
lock state, so that shouldn't be very hard to add.
Yeah. It's probably more efficient to not release the spinlock between
checking the value and LWLockQueueSelf() though.
Right now LWLockQueueSelf() takes spinlocks etc itself, and is used that
way in a bunch of callsites... So that'd be harder. Additionally I'm
planning to get rid of the spinlocks around queuing (they show up as
bottlenecks in contended workloads), so it seems more future proof not
to assume that either way. On top of that I think we should, when
available (or using the same type of fallback as for 32bit?), use 64 bit
atomics for the var anyway.I'll take a stab at fixing this tomorrow.
Thanks! Do you mean both or "just" the second issue?
Both. Here's the patch.
Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet.
This passes make check, but I haven't done any testing beyond that. Does this look sane to you?
After applying this patch to commit fdf28853ae6a397497b79f, it has survived testing long enough to convince that this fixes the problem.
Cheers,
Jeff
On Thu, Jul 16, 2015 at 12:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Both. Here's the patch.
Previously, LWLockAcquireWithVar set the variable associated with the lock atomically with acquiring it. Before the lwlock-scalability changes, that was straightforward because you held the spinlock anyway, but it's a lot harder/expensive now. So I changed the way acquiring a lock with a variable works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that the current lock holder has updated the variable. The LWLockAcquireWithVar function is gone - you now just use LWLockAcquire(), which always clears the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you want to set the variable immediately. LWLockWaitForVar() always waits if the flag is not set, i.e. it will not return regardless of the variable's value, if the current lock-holder has not updated it yet.I ran this for a while without casserts and it seems to work. But with casserts, I get failures in the autovac process on the GIN index.I don't see how this is related to the LWLock issue, but I didn't see it without your patch. Perhaps the system just didn't survive long enough to uncover it without the patch (although it shows up pretty quickly). It could just be an overzealous Assert, since the casserts off didn't show problems.
bt and bt full are shown below.Cheers,Jeff#0 0x0000003dcb632625 in raise () from /lib64/libc.so.6#1 0x0000003dcb633e05 in abort () from /lib64/libc.so.6#2 0x0000000000930b7a in ExceptionalCondition (conditionName=0x9a1440 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9a12bc "FailedAssertion",fileName=0x9a12b0 "ginvacuum.c", lineNumber=713) at assert.c:54#3 0x00000000004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at ginvacuum.c:713
It now looks like this *is* unrelated to the LWLock issue. The assert that it is tripping over was added just recently (302ac7f27197855afa8c) and so I had not been testing under its presence until now. It looks like it is finding all-zero pages (index extended but then a crash before initializing the page?) and it doesn't like them.
(gdb) f 3
(gdb) p *(char[8192]*)(page)
$11 = '\000' <repeats 8191 times>
Presumably before this assert, such pages would just be permanently orphaned.
Cheers,
Jeff
On 2015-07-16 PM 04:03, Jeff Janes wrote: > On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> >> Both. Here's the patch. >> >> Previously, LWLockAcquireWithVar set the variable associated with the lock >> atomically with acquiring it. Before the lwlock-scalability changes, that >> was straightforward because you held the spinlock anyway, but it's a lot >> harder/expensive now. So I changed the way acquiring a lock with a variable >> works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that >> the current lock holder has updated the variable. The LWLockAcquireWithVar >> function is gone - you now just use LWLockAcquire(), which always clears >> the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if >> you want to set the variable immediately. LWLockWaitForVar() always waits >> if the flag is not set, i.e. it will not return regardless of the >> variable's value, if the current lock-holder has not updated it yet. >> >> > I ran this for a while without casserts and it seems to work. But with > casserts, I get failures in the autovac process on the GIN index. > > I don't see how this is related to the LWLock issue, but I didn't see it > without your patch. Perhaps the system just didn't survive long enough to > uncover it without the patch (although it shows up pretty quickly). It > could just be an overzealous Assert, since the casserts off didn't show > problems. > > bt and bt full are shown below. > I got a similar assert failure but with a btree index (pg_attribute_relid_attnum_index). The backtrace looks like Jeff's: (gdb) bt #0 0x0000003969632625 in raise () from /lib64/libc.so.6 #1 0x0000003969633e05 in abort () from /lib64/libc.so.6 #2 0x000000000092eb9e in ExceptionalCondition (conditionName=0x9c2220 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9c0c41 "FailedAssertion", fileName=0x9c0c10 "nbtree.c", lineNumber=903) at assert.c:54 #3 0x00000000004e02d8 in btvacuumpage (vstate=0x7fff2c7655f0, blkno=9, orig_blkno=9) at nbtree.c:903 #4 0x00000000004e0067 in btvacuumscan (info=0x7fff2c765cd0, stats=0x279f7d0, callback=0x668f6d <lazy_tid_reaped>, callback_state=0x279e338, cycleid=49190) at nbtree.c:821 #5 0x00000000004dfdde in btbulkdelete (fcinfo=0x7fff2c7657d0) at nbtree.c:676 #6 0x0000000000939769 in FunctionCall4Coll (flinfo=0x7fff2c765bb0, collation=0, arg1=140733939342544, arg2=0, arg3=6721389, arg4=41542456) at fmgr.c:1375 #7 0x00000000004d2a01 in index_bulk_delete (info=0x7fff2c765cd0, stats=0x0, callback=0x668f6d <lazy_tid_reaped>, callback_state=0x279e338) at indexam.c:690 #8 0x000000000066861d in lazy_vacuum_index (indrel=0x7fd40ab846f0, stats=0x279e770, vacrelstats=0x279e338) at vacuumlazy.c:1367 #9 0x00000000006678a8 in lazy_scan_heap (onerel=0x274ec90, vacrelstats=0x279e338, Irel=0x279e790, nindexes=2, scan_all=0 '\000') at vacuumlazy.c:1098 #10 0x00000000006660f7 in lazy_vacuum_rel (onerel=0x274ec90, options=99, params=0x27bdc88, bstrategy=0x27bdd18) at vacuumlazy.c:244 #11 0x0000000000665c1a in vacuum_rel (relid=1249, relation=0x7fff2c7662a0, options=99, params=0x27bdc88) at vacuum.c:1388 #12 0x00000000006643ce in vacuum (options=99, relation=0x7fff2c7662a0, relid=1249, params=0x27bdc88, va_cols=0x0, bstrategy=0x27bdd18, isTopLevel=1 '\001') at vacuum.c:293 #13 0x000000000075d23c in autovacuum_do_vac_analyze (tab=0x27bdc80, bstrategy=0x27bdd18) at autovacuum.c:2807 #14 0x000000000075c632 in do_autovacuum () at autovacuum.c:2328 #15 0x000000000075b457 in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1647 #16 0x000000000075b0a5 in StartAutoVacWorker () at autovacuum.c:1454 #17 0x000000000076f9cc in StartAutovacuumWorker () at postmaster.c:5261 #18 0x000000000076f28a in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:4918 #19 <signal handler called> #20 0x00000039696e1353 in __select_nocancel () from /lib64/libc.so.6 #21 0x000000000076ace7 in ServerLoop () at postmaster.c:1582 #22 0x000000000076a538 in PostmasterMain (argc=3, argv=0x26f9330) at postmaster.c:1263 #23 0x00000000006c1c2e in main (argc=3, argv=0x26f9330) at main.c:223 Thanks, Amit
Hi, On 2015-07-19 11:49:14 -0700, Jeff Janes wrote: > After applying this patch to commit fdf28853ae6a397497b79f, it has survived > testing long enough to convince that this fixes the problem. What was the actual workload breaking with the bug? I ran a small variety and I couldn't reproduce it yet. I'm not saying there's no bug, I just would like to be able to test my version of the fixes... Andres
On Tue, Jul 28, 2015 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2015-07-19 11:49:14 -0700, Jeff Janes wrote:
> After applying this patch to commit fdf28853ae6a397497b79f, it has survived
> testing long enough to convince that this fixes the problem.
What was the actual workload breaking with the bug? I ran a small
variety and I couldn't reproduce it yet. I'm not saying there's no bug,
I just would like to be able to test my version of the fixes...
It was the torn-page fault-injection code here:
It is not a minimal set, I don't know if all parts of this are necessary to rerproduce it. The whole crash-recovery cycling might not even be important.
Compiled with:
./configure --enable-debug --with-libxml --with-perl --with-python --with-ldap --with-openssl --with-gssapi --prefix=/home/jjanes/pgsql/torn_bisect/
(Also with or without --enable-cassert).
I just ran "sh do.sh >& do.out" and eventually it stopped producing output, and I find everything hung up.
Cheers,
Jeff
Hi, Finally getting to this. On 2015-07-15 18:44:03 +0300, Heikki Linnakangas wrote: > Previously, LWLockAcquireWithVar set the variable associated with the lock > atomically with acquiring it. Before the lwlock-scalability changes, that > was straightforward because you held the spinlock anyway, but it's a lot > harder/expensive now. So I changed the way acquiring a lock with a variable > works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that > the current lock holder has updated the variable. The LWLockAcquireWithVar > function is gone - you now just use LWLockAcquire(), which always clears the > LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you > want to set the variable immediately. > > This passes make check, but I haven't done any testing beyond that. Does > this look sane to you? The prime thing I dislike about the patch is how long it now holds the spinlock in WaitForVar. I don't understand why that's necessary? There's no need to hold a spinlock until after the mustwait = (pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE) != 0; unless I miss something? In an earlier email you say: > After the spinlock is released above, but before the LWLockQueueSelf() call, > it's possible that another backend comes in, acquires the lock, changes the > variable's value, and releases the lock again. In 9.4, the spinlock was not > released until the process was queued. But that's not a problem. The updater in that case will have queued a wakeup for all waiters, including WaitForVar()? I'll try to reproduce the problem now. But I do wonder if it's possibly just the missing spinlock during the update. Andres
On 07/29/2015 02:39 PM, Andres Freund wrote: > On 2015-07-15 18:44:03 +0300, Heikki Linnakangas wrote: >> Previously, LWLockAcquireWithVar set the variable associated with the lock >> atomically with acquiring it. Before the lwlock-scalability changes, that >> was straightforward because you held the spinlock anyway, but it's a lot >> harder/expensive now. So I changed the way acquiring a lock with a variable >> works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that >> the current lock holder has updated the variable. The LWLockAcquireWithVar >> function is gone - you now just use LWLockAcquire(), which always clears the >> LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if you >> want to set the variable immediately. >> >> This passes make check, but I haven't done any testing beyond that. Does >> this look sane to you? > > The prime thing I dislike about the patch is how long it now holds the > spinlock in WaitForVar. I don't understand why that's necessary? There's > no need to hold a spinlock until after the > mustwait = (pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE) != 0; > unless I miss something? > > In an earlier email you say: >> After the spinlock is released above, but before the LWLockQueueSelf() call, >> it's possible that another backend comes in, acquires the lock, changes the >> variable's value, and releases the lock again. In 9.4, the spinlock was not >> released until the process was queued. > > But that's not a problem. The updater in that case will have queued a > wakeup for all waiters, including WaitForVar()? If you release the spinlock before LWLockQueueSelf(), then no. It's possible that the backend you wanted to wait for updates the variable's value before you've queued up. Or even releases the lock, and it gets re-acquired by another backend, before you've queued up. Or are you thinking of just moving the SpinLockRelease to after LWLockQueueSelf(), i.e. to the other side of the "mustwait = ..." line? That'd probably be ok. - Heikki
On 2015-07-29 14:55:54 +0300, Heikki Linnakangas wrote: > On 07/29/2015 02:39 PM, Andres Freund wrote: > >In an earlier email you say: > >>After the spinlock is released above, but before the LWLockQueueSelf() call, > >>it's possible that another backend comes in, acquires the lock, changes the > >>variable's value, and releases the lock again. In 9.4, the spinlock was not > >>released until the process was queued. > > > >But that's not a problem. The updater in that case will have queued a > >wakeup for all waiters, including WaitForVar()? > > If you release the spinlock before LWLockQueueSelf(), then no. It's possible > that the backend you wanted to wait for updates the variable's value before > you've queued up. Or even releases the lock, and it gets re-acquired by > another backend, before you've queued up. For normal locks the equivalent problem is solved by re-checking wether a conflicting lock is still held after enqueuing. Why don't we do the same here? Doing it that way has the big advantage that we can just remove the spinlocks entirely on platforms with atomic 64bit reads/writes. Greetings, Andres Freund
On 07/29/2015 03:08 PM, Andres Freund wrote: > On 2015-07-29 14:55:54 +0300, Heikki Linnakangas wrote: >> On 07/29/2015 02:39 PM, Andres Freund wrote: >>> In an earlier email you say: >>>> After the spinlock is released above, but before the LWLockQueueSelf() call, >>>> it's possible that another backend comes in, acquires the lock, changes the >>>> variable's value, and releases the lock again. In 9.4, the spinlock was not >>>> released until the process was queued. >>> >>> But that's not a problem. The updater in that case will have queued a >>> wakeup for all waiters, including WaitForVar()? >> >> If you release the spinlock before LWLockQueueSelf(), then no. It's possible >> that the backend you wanted to wait for updates the variable's value before >> you've queued up. Or even releases the lock, and it gets re-acquired by >> another backend, before you've queued up. > > For normal locks the equivalent problem is solved by re-checking wether > a conflicting lock is still held after enqueuing. Why don't we do the > same here? Doing it that way has the big advantage that we can just > remove the spinlocks entirely on platforms with atomic 64bit > reads/writes. Ah, ok, that should work, as long as you also re-check the variable's value after queueing. Want to write the patch, or should I? - Heikki
On 2015-07-29 15:14:23 +0300, Heikki Linnakangas wrote: > Ah, ok, that should work, as long as you also re-check the variable's value > after queueing. Want to write the patch, or should I? I'll try. Shouldn't be too hard. Andres
On 2015-07-29 14:22:23 +0200, Andres Freund wrote: > On 2015-07-29 15:14:23 +0300, Heikki Linnakangas wrote: > > Ah, ok, that should work, as long as you also re-check the variable's value > > after queueing. Want to write the patch, or should I? > > I'll try. Shouldn't be too hard. What do you think about something roughly like the attached? - Andres
Attachment
On Tue, Jul 28, 2015 at 9:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jul 28, 2015 at 7:06 AM, Andres Freund <andres@anarazel.de> wrote:Hi,
On 2015-07-19 11:49:14 -0700, Jeff Janes wrote:
> After applying this patch to commit fdf28853ae6a397497b79f, it has survived
> testing long enough to convince that this fixes the problem.
What was the actual workload breaking with the bug? I ran a small
variety and I couldn't reproduce it yet. I'm not saying there's no bug,
I just would like to be able to test my version of the fixes...It was the torn-page fault-injection code here:It is not a minimal set, I don't know if all parts of this are necessary to rerproduce it. The whole crash-recovery cycling might not even be important.
I've reproduced it again against commit b2ed8edeecd715c8a23ae462.
It took 5 hours on a 8 core "Intel(R) Xeon(R) CPU E5-2650".
I also reproduced it in 3 hours on the same machine with both JJ_torn_page and JJ_xid set to zero (i.e. turned off, no induced crashes), so the fault-injection patch should not be necessary to get the issue..
Cheers,
Jeff
On 2015-07-29 09:23:32 -0700, Jeff Janes wrote: > On Tue, Jul 28, 2015 at 9:06 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > I've reproduced it again against commit b2ed8edeecd715c8a23ae462. > > It took 5 hours on a 8 core "Intel(R) Xeon(R) CPU E5-2650". > > I also reproduced it in 3 hours on the same machine with both JJ_torn_page > and JJ_xid set to zero (i.e. turned off, no induced crashes), so the > fault-injection patch should not be necessary to get the issue.. Hm, that's a single socket or dual socket E5-2650 system? That CPU itself has 8 cores, and can work in a dual socket system, that's why I ask.
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Jul 29, 2015 at 9:26 AM, Andres Freund <span dir="ltr"><<ahref="mailto:andres@anarazel.de" target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2015-07-29 09:23:32-0700, Jeff Janes wrote:<br /><span class="">> On Tue, Jul 28, 2015 at 9:06 AM, Jeff Janes <<a href="mailto:jeff.janes@gmail.com">jeff.janes@gmail.com</a>>wrote:<br /> > I've reproduced it again against commitb2ed8edeecd715c8a23ae462.<br /> ><br /> > It took 5 hours on a 8 core "Intel(R) Xeon(R) CPU E5-2650".<br /> ><br/> > I also reproduced it in 3 hours on the same machine with both JJ_torn_page<br /> > and JJ_xid set to zero(i.e. turned off, no induced crashes), so the<br /> > fault-injection patch should not be necessary to get the issue..<br/><br /></span>Hm, that's a single socket or dual socket E5-2650 system? That CPU<br /> itself has 8 cores, andcan work in a dual socket system, that's why I<br /> ask.<br /></blockquote></div><br /></div><div class="gmail_extra">Itis a virtual machine, and I think a property of the VM software is that any given virtual machine canonly run on one socket of the underlying hardware. The real machine is dual socket, though.</div><div class="gmail_extra"><br/></div><div class="gmail_extra">Cheers,</div><div class="gmail_extra"><br /></div><div class="gmail_extra">Jeff</div></div>
On 07/29/2015 04:10 PM, Andres Freund wrote: > On 2015-07-29 14:22:23 +0200, Andres Freund wrote: >> On 2015-07-29 15:14:23 +0300, Heikki Linnakangas wrote: >>> Ah, ok, that should work, as long as you also re-check the variable's value >>> after queueing. Want to write the patch, or should I? >> >> I'll try. Shouldn't be too hard. > > What do you think about something roughly like the attached? Hmm. Imagine this: Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on it. The lock holder releases the lock, and wakes up A. But before A wakes up and sees that the lock is free, another backend acquires the lock again. It runs LWLockAcquireWithVar to the point just before setting the variable's value. Now A wakes up, sees that the lock is still (or again) held, and that the variable's value still matches the old one, and goes back to sleep. The new lock holder won't wake it up until it updates the value again, or to releases the lock. I think that's OK given the current usage of this in WALInsertLocks, but it's scary. At the very least you need comments to explain that race condition. You didn't like the new LW_FLAG_VAR_SET flag and the API changes I proposed? That eliminates the race condition. Either way, there is a race condition that if the new lock holder sets the variable to the *same* value as before, the old waiter won't necessarily wake up even though the lock was free or had a different value in between. But that's easier to explain and understand than the fact that the value set by LWLockAcquireWithVar() might not be seen by a waiter, until you release the lock or update it again. Another idea is to have LWLockAcquire check the wait queue after setting the variable, and wake up anyone who's already queued up. You could just call LWLockUpdateVar() from LWLockAcquireCommon to set the variable. I would prefer the approach from my previous patch more, though. That would avoid having to acquire the spinlock in LWLockAcquire altogether, and I actually like the API change from that independently from any correctness issues. The changes in LWLockWaitForVar() and LWLockConflictsWithVar() seem OK in principle. You'll want to change LWLockConflictsWithVar() so that the spinlock is held only over the "value = *valptr" line, though; the other stuff just modifies local variables and don't need to be protected by the spinlock. Also, when you enter LWLockWaitForVar(), you're checking if the lock is held twice in a row; first at the top of the function, and again inside LWLockConflictsWithVar. You could just remove the "quick test" at the top. - Heikki
On 2015-07-29 20:23:24 +0300, Heikki Linnakangas wrote: > Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on > it. The lock holder releases the lock, and wakes up A. But before A wakes up > and sees that the lock is free, another backend acquires the lock again. It > runs LWLockAcquireWithVar to the point just before setting the variable's > value. Now A wakes up, sees that the lock is still (or again) held, and that > the variable's value still matches the old one, and goes back to sleep. The > new lock holder won't wake it up until it updates the value again, or to > releases the lock. I'm not sure whether this observation is about my patch or the general lwlock variable mechanism. In my opinion that behaviour exists today both in 9.4 and 9.5. But I think that's fine because that "race" seems pretty fundamental. After all, you could have called LWLockWaitForVar() just after the second locker had set the variable to the same value. > You didn't like the new LW_FLAG_VAR_SET flag and the API changes I > proposed? I do slightly dislike the additional bit of math in AttemptLock. But what I really disliked about the patch is the reliance on holding the spinlock over longer times and conditionally acquiring spinlocks. I didn't see a need for the flags change to fix the issue at hand, that's why I didn't incorporate it. I'm not fundamentally against it. > Either way, there is a race condition that if the new lock holder sets the > variable to the *same* value as before, the old waiter won't necessarily > wake up even though the lock was free or had a different value in between. > But that's easier to explain and understand than the fact that the value set > by LWLockAcquireWithVar() might not be seen by a waiter, until you release > the lock or update it again. I can't really se a use for the API that'd allow that and care about the waits. Because quite fundamentally you could just started waiting after the variable was set to the value at the same time. > Another idea is to have LWLockAcquire check the wait queue after setting the > variable, and wake up anyone who's already queued up. Ugh, not a fan of that. > The changes in LWLockWaitForVar() and LWLockConflictsWithVar() seem OK in > principle. You'll want to change LWLockConflictsWithVar() so that the > spinlock is held only over the "value = *valptr" line, though; the other > stuff just modifies local variables and don't need to be protected by the > spinlock. good point., > Also, when you enter LWLockWaitForVar(), you're checking if the > lock is held twice in a row; first at the top of the function, and again > inside LWLockConflictsWithVar. You could just remove the "quick test" at the > top. Yea, I was thinking about removing it. The first check was there previously, so I left it in place. We do execute a bit more code once we've disabled interrupts (extraWaits, re-enabling interrupts). I don't think it'll matter much, but it seemed like an independent thing. Greetings, Andres Freund
On 07/29/2015 09:35 PM, Andres Freund wrote: > On 2015-07-29 20:23:24 +0300, Heikki Linnakangas wrote: >> Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on >> it. The lock holder releases the lock, and wakes up A. But before A wakes up >> and sees that the lock is free, another backend acquires the lock again. It >> runs LWLockAcquireWithVar to the point just before setting the variable's >> value. Now A wakes up, sees that the lock is still (or again) held, and that >> the variable's value still matches the old one, and goes back to sleep. The >> new lock holder won't wake it up until it updates the value again, or to >> releases the lock. > > I'm not sure whether this observation is about my patch or the general > lwlock variable mechanism. In my opinion that behaviour exists today > both in 9.4 and 9.5. In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held, until it has updated the variable. And LWLockWaitForVar() holds the spinlock when it checks that the lock is held and that the variable's value matches. So it cannot happen on 9.4. To reiterate, with 9.5, it's possible that a backend is sleeping in LWLockWaitForVar(oldvar=123), even though the lock is currently held by another backend with value 124. That seems wrong, or surprising at the very least. > But I think that's fine because that "race" seems pretty > fundamental. After all, you could have called LWLockWaitForVar() just > after the second locker had set the variable to the same value. I'm not talking about setting it to the same value. I'm talking about setting it to a different value. (I talked about setting it to the same value later in the email, and I agree that's a pretty fundamental problem and exists with 9.4 as well). - Heikki
On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-29 14:22:23 +0200, Andres Freund wrote:
> On 2015-07-29 15:14:23 +0300, Heikki Linnakangas wrote:
> > Ah, ok, that should work, as long as you also re-check the variable's value
> > after queueing. Want to write the patch, or should I?
>
> I'll try. Shouldn't be too hard.
What do you think about something roughly like the attached?
I've not evaluated the code, but applying it does solve the problem I was seeing.
Cheers,
Jeff
On 2015-07-30 17:36:52 +0300, Heikki Linnakangas wrote: > In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held, > until it has updated the variable. And LWLockWaitForVar() holds the spinlock > when it checks that the lock is held and that the variable's value matches. > So it cannot happen on 9.4. The first paragraph talked about "the same value", but that was just referring to it not yet having been cleared i think... > To reiterate, with 9.5, it's possible that a backend is sleeping in > LWLockWaitForVar(oldvar=123), even though the lock is currently held by > another backend with value 124. That seems wrong, or surprising at the very > least. With my patch that can't really happen that way though? The value is re-checked after queuing. If it has changed by then we're done. And if it hasn't yet changed we're guaranteed to be woken up once it's being changed? I generaly don't mind adding some sort of flag clearing or such, but I'd rather not have it in the retry loop in the general LWLockAttemptLock path - I found that very small amounts of instructions in there have a measurable impact. Greetings, Andres Freund
On 2015-07-30 09:03:01 -0700, Jeff Janes wrote: > On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote: > > What do you think about something roughly like the attached? > > > > I've not evaluated the code, but applying it does solve the problem I was > seeing. Cool, thanks for testing! How long did you run the test and how long did it, on average, previously take for the issue to occur? Regards, Andres
On 07/30/2015 09:14 PM, Andres Freund wrote: > On 2015-07-30 17:36:52 +0300, Heikki Linnakangas wrote: >> In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held, >> until it has updated the variable. And LWLockWaitForVar() holds the spinlock >> when it checks that the lock is held and that the variable's value matches. >> So it cannot happen on 9.4. > > The first paragraph talked about "the same value", but that was just > referring to it not yet having been cleared i think... > >> To reiterate, with 9.5, it's possible that a backend is sleeping in >> LWLockWaitForVar(oldvar=123), even though the lock is currently held by >> another backend with value 124. That seems wrong, or surprising at the very >> least. > > With my patch that can't really happen that way though? The value is > re-checked after queuing. If it has changed by then we're done. And if > it hasn't yet changed we're guaranteed to be woken up once it's being > changed? Ok, let me try explaining it again. With your patch, LWLockAcquireWithVar looks like this (simplified, assuming the lock is free): 1. Set LW_VAL_EXCLUSIVE 3. (Acquire spinlock) Set *valptr = val (Release spinlock) LWLockWaitForVar looks like this: 1. Check if lock is free 2. (Acquire Spinlock) Read *valptr, compare with oldval (Release spinlock) 3. LWLockQueueSelf 4. Re-check that *valptr is still equal to oldval 5. sleep. This is the race condition: Backend A Backend B --------- --------- LWLockAcquireWithVar(123) Set LW_VAL_EXCLUSIVE Set *valptr = 123 LWLockWaitForVar(123) See thatthe lock is held Read *valptr, it matches LWLockQueueSelf LWLockRelease() LWLockAcquireWithVar(124) Set LW_VAL_EXCLUSIVE wakes up See that the lock is still/again held Read *valptr, it's still 123 LWLockQueueSelf Re-check that *valptr is still 123. go to sleep. Set *valptr = 124 Now, Backend B's LWLockWaitForVar(123) call is sleeping, even though the lock was free'd and reacquired with different value, 124. It won't wake up until A updates the value or releases the lock again. This was not possible in 9.4. It was possible in 9.4 too when the value was same in both LWLockAcquireWithVar() calls, and I think that's acceptable, but not when the values differ. > I generaly don't mind adding some sort of flag clearing or such, but I'd > rather not have it in the retry loop in the general LWLockAttemptLock > path - I found that very small amounts of instructions in there have a > measurable impact. I doubt clearing a bit would matter there, although you might have a better instinct on that... - Heikki
On Thu, Jul 30, 2015 at 8:22 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-30 09:03:01 -0700, Jeff Janes wrote:
> On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote:
> > What do you think about something roughly like the attached?
> >
>
> I've not evaluated the code, but applying it does solve the problem I was
> seeing.
Cool, thanks for testing! How long did you run the test and how long did
it, on average, previously take for the issue to occur?
I had run it for 24 hours, while it usually took less than 8 hours to look up before. I did see it within a few minutes one time when I checked out a new HEAD and then forgot to re-apply your or Heikki's patch.
But now I've got the same situation again, after 15 hours, with your patch. This is probably all down to luck. The only differences that I can think of is that I advanced the base to e8e86fbc8b3619da54c, and turned on JJ_vac and set log_autovacuum_min_duration=0.
checkpointer, 31516
(gdb) bt
#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd3c5f8) at pg_sema.c:387
#2 0x00000000006a5754 in LWLockAcquireCommon (l=0x7f84c42edf80, valptr=0x7f84c42edfa8, val=18446744073709551615) at lwlock.c:1041
#3 LWLockAcquireWithVar (l=0x7f84c42edf80, valptr=0x7f84c42edfa8, val=18446744073709551615) at lwlock.c:915
#4 0x00000000004cbd86 in WALInsertLockAcquireExclusive () at xlog.c:1445
#5 0x00000000004d4195 in CreateCheckPoint (flags=64) at xlog.c:8181
#6 0x000000000064ea53 in CheckpointerMain () at checkpointer.c:493
#7 0x00000000004e328a in AuxiliaryProcessMain (argc=2, argv=0x7fff6eed1f50) at bootstrap.c:428
#8 0x00000000006563f7 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5167
#9 0x000000000065884a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2670
(gdb) f 2
(gdb) p *(lock->owner)
$1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x7f84ccd27c88, sem = {semId = 1480359975, semNum = 12}, waitStatus = 0, procLatch = {is_set = 1,
is_shared = 1 '\001', owner_pid = 31540}, lxid = 3209, pid = 31540, pgprocno = 92, backendId = 12, databaseId = 16384, roleId = 10,
recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 2 '\002', lwWaitLink = {prev = 0x7f84c42edf10, next = 0x7f84ccd38e60},
waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0},
myProcLocks = {{prev = 0x7f84ccd38bb8, next = 0x7f84ccd38bb8}, {prev = 0x7f84ccd38bc8, next = 0x7f84ccd38bc8}, {prev = 0x7f84ccd38bd8,
next = 0x7f84ccd38bd8}, {prev = 0x7f84ccd38be8, next = 0x7f84ccd38be8}, {prev = 0x7f84ccd38bf8, next = 0x7f84ccd38bf8}, {prev = 0x7f84ccd38c08,
next = 0x7f84ccd38c08}, {prev = 0x7f84ccd38c18, next = 0x7f84ccd38c18}, {prev = 0x7f84ccd38c28, next = 0x7f84ccd38c28}, {prev = 0x7f84ccd38c38,
next = 0x7f84ccd38c38}, {prev = 0x7f84ccd38c48, next = 0x7f84ccd38c48}, {prev = 0x7f84ccd38c58, next = 0x7f84ccd38c58}, {prev = 0x7f84ccd38c68,
next = 0x7f84ccd38c68}, {prev = 0x7f84ccd38c78, next = 0x7f84ccd38c78}, {prev = 0x7f84ccd38c88, next = 0x7f84ccd38c88}, {prev = 0x7f84ccb0ceb0,
next = 0x7f84ccb0ceb0}, {prev = 0x7f84ccd38ca8, next = 0x7f84ccd38ca8}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock = 0x7f84c42e9800,
fpLockBits = 160528697655296, fpRelId = {0, 0, 0, 0, 0, 0, 0, 0, 0, 2679, 2610, 16413, 16411, 16415, 16414, 16408}, fpVXIDLock = 1 '\001',
fpLocalTransactionId = 3209}
backend 31540:
#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd38b38) at pg_sema.c:387
#2 0x00000000006a53d4 in LWLockWaitForVar (lock=0x7f84c42edf00, valptr=0x7f84c42edf28, oldval=0, newval=0x7fff6eeceaf8) at lwlock.c:1438
#3 0x00000000004d3276 in WaitXLogInsertionsToFinish (upto=183341449216) at xlog.c:1576
#4 0x00000000004d3b33 in AdvanceXLInsertBuffer (upto=183345636352, opportunistic=<value optimized out>) at xlog.c:1888
#5 0x00000000004d8df2 in GetXLogBuffer (ptr=183345636352) at xlog.c:1669
#6 0x00000000004d8fea in CopyXLogRecordToWAL (rdata=0xbd2410, fpw_lsn=<value optimized out>) at xlog.c:1276
#7 XLogInsertRecord (rdata=0xbd2410, fpw_lsn=<value optimized out>) at xlog.c:1008
#8 0x00000000004dc72c in XLogInsert (rmid=13 '\r', info=32 ' ') at xloginsert.c:453
#9 0x0000000000478f41 in ginPlaceToPage (btree=0x7fff6eecee20, stack=0x1be11e0, insertdata=<value optimized out>, updateblkno=<value optimized out>,
childbuf=0, buildStats=0x0) at ginbtree.c:418
#10 0x00000000004798a3 in ginInsertValue (btree=0x7fff6eecee20, stack=0x1be11e0, insertdata=0x7fff6eeceea0, buildStats=0x0) at ginbtree.c:748
#11 0x00000000004734aa in ginEntryInsert (ginstate=0x7fff6eecf200, attnum=39408, key=1, category=<value optimized out>, items=0x1c81c20, nitem=54,
buildStats=0x0) at gininsert.c:234
#12 0x000000000047edbb in ginInsertCleanup (ginstate=0x7fff6eecf200, vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843
#13 0x000000000047fae6 in ginHeapTupleFastInsert (ginstate=0x7fff6eecf200, collector=<value optimized out>) at ginfast.c:436
#14 0x000000000047386a in gininsert (fcinfo=<value optimized out>) at gininsert.c:531
...
(gdb) f 2
(gdb) p *(lock->owner)
$1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x0, sem = {semId = 1480425513, semNum = 0}, waitStatus = 0, procLatch = {is_set = 0,
is_shared = 1 '\001', owner_pid = 31516}, lxid = 0, pid = 31516, pgprocno = 112, backendId = -1, databaseId = 0, roleId = 0,
recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 0 '\000', lwWaitLink = {prev = 0x7f84ccd3cc10, next = 0x7f84ccd3a5e0},
waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0},
myProcLocks = {{prev = 0x7f84ccd3c678, next = 0x7f84ccd3c678}, {prev = 0x7f84ccd3c688, next = 0x7f84ccd3c688}, {prev = 0x7f84ccd3c698,
next = 0x7f84ccd3c698}, {prev = 0x7f84ccd3c6a8, next = 0x7f84ccd3c6a8}, {prev = 0x7f84ccd3c6b8, next = 0x7f84ccd3c6b8}, {prev = 0x7f84ccd3c6c8,
next = 0x7f84ccd3c6c8}, {prev = 0x7f84ccd3c6d8, next = 0x7f84ccd3c6d8}, {prev = 0x7f84ccd3c6e8, next = 0x7f84ccd3c6e8}, {prev = 0x7f84ccd3c6f8,
next = 0x7f84ccd3c6f8}, {prev = 0x7f84ccd3c708, next = 0x7f84ccd3c708}, {prev = 0x7f84ccd3c718, next = 0x7f84ccd3c718}, {prev = 0x7f84ccd3c728,
next = 0x7f84ccd3c728}, {prev = 0x7f84ccd3c738, next = 0x7f84ccd3c738}, {prev = 0x7f84ccd3c748, next = 0x7f84ccd3c748}, {prev = 0x7f84ccd3c758,
next = 0x7f84ccd3c758}, {prev = 0x7f84ccd3c768, next = 0x7f84ccd3c768}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock = 0x7f84c42e9d00,
fpLockBits = 0, fpRelId = {0 <repeats 16 times>}, fpVXIDLock = 0 '\000', fpLocalTransactionId = 0}
Cheers,
Jeff
Hi Jeff, Heikki, On 2015-07-31 09:48:28 -0700, Jeff Janes wrote: > I had run it for 24 hours, while it usually took less than 8 hours to look > up before. I did see it within a few minutes one time when I checked out a > new HEAD and then forgot to re-apply your or Heikki's patch. > > But now I've got the same situation again, after 15 hours, with your > patch. This is probably all down to luck. The only differences that I can > think of is that I advanced the base to e8e86fbc8b3619da54c, and turned on > JJ_vac and set log_autovacuum_min_duration=0. It's quite possible that you hit the remaining race-condition that Heikki was talking about. So that'd make it actually likely to be hit slightly later - but as you say this is just a game of chance. I've attached a version of the patch that should address Heikki's concern. It imo also improves the API and increases debuggability by not having stale variable values in the variables anymore. (also attached is a minor optimization that Heikki has noticed) I plan to commit the patch tomorrow, so it's included in alpha2. Regards, Andres
On 2015-08-02 17:04:07 +0200, Andres Freund wrote: > I've attached a version of the patch that should address Heikki's > concern. It imo also improves the API and increases debuggability by not > having stale variable values in the variables anymore. (also attached is > a minor optimization that Heikki has noticed) ...
Attachment
Andres Freund <andres@anarazel.de> writes: > I plan to commit the patch tomorrow, so it's included in alpha2. Please try to commit anything you want in alpha2 *today*. I'd prefer to see some successful buildfarm cycles on such patches before we wrap. regards, tom lane
On 2015-08-02 12:33:06 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I plan to commit the patch tomorrow, so it's included in alpha2. > > Please try to commit anything you want in alpha2 *today*. I'd > prefer to see some successful buildfarm cycles on such patches > before we wrap. Ok, will do that.
On Sun, Aug 2, 2015 at 8:05 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-08-02 17:04:07 +0200, Andres Freund wrote:
> I've attached a version of the patch that should address Heikki's
> concern. It imo also improves the API and increases debuggability by not
> having stale variable values in the variables anymore. (also attached is
> a minor optimization that Heikki has noticed)
I've run these (from your git commit) for over 60 hours with no problem, so I think it's solved.
If there are still problems, hopefully they will come out in other tests. I don't know why the gin test was efficient at provoking the problem while none of the other ones (btree-upsert, gist, brin, btree-foreign key, btree-prepare transaction) I've played around with. Perhaps it is just due to the amount of WAL which gin generates.
Cheers,
Jeff
On 2015-08-05 11:22:55 -0700, Jeff Janes wrote: > On Sun, Aug 2, 2015 at 8:05 AM, Andres Freund <andres@anarazel.de> wrote: > > > On 2015-08-02 17:04:07 +0200, Andres Freund wrote: > > > I've attached a version of the patch that should address Heikki's > > > concern. It imo also improves the API and increases debuggability by not > > > having stale variable values in the variables anymore. (also attached is > > > a minor optimization that Heikki has noticed) > > > > > I've run these (from your git commit) for over 60 hours with no problem, so > I think it's solved. Cool! Thanks for the testing. > If there are still problems, hopefully they will come out in other tests. > I don't know why the gin test was efficient at provoking the problem while > none of the other ones (btree-upsert, gist, brin, btree-foreign key, > btree-prepare transaction) I've played around with. Perhaps it is just due > to the amount of WAL which gin generates. I'm not sure either... I've been able to reproduce a version of the issue by judiciously sprinkling pg_usleep()'s over the code. Regards, Andres