Thread: SIGSEGV in GrantLockLocal()
Recently, I've encountered a core dump several times on master, with a backtrace like the one below. This one happened on 0f23dedc9. I was running some fuzz testing and had started around 20 sessions concurrently. (gdb) bt #0 in GrantLockLocal at lock.c:1758 #1 in GrantAwaitedLock at lock.c:1840 #2 in LockErrorCleanup at proc.c:809 #3 in AbortTransaction at xact.c:2846 #4 in AbortCurrentTransactionInternal at xact.c:3520 #5 in AbortCurrentTransaction at xact.c:3449 #6 in PostgresMain at postgres.c:4535 #7 in BackendMain at backend_startup.c:107 #8 in postmaster_child_launch at launch_backend.c:274 #9 in BackendStartup at postmaster.c:3391 #10 in ServerLoop at postmaster.c:1678 #11 in PostmasterMain at postmaster.c:1376 #12 in main at main.c:224 It seems that the lock request is not granted as expected, since locallock->lockOwners is a NULL pointer. (gdb) p locallock->lockOwners $4 = (LOCALLOCKOWNER *) 0x0 (gdb) p locallock->numLockOwners $5 = 0 (gdb) p locallock->maxLockOwners $6 = 8 Unfortunately, I don't have a reliable way to trigger this issue. I'm wondering if anyone has any insights into what might be happening. Thanks Richard
On 18/12/2024 08:17, Richard Guo wrote: > Recently, I've encountered a core dump several times on master, with a > backtrace like the one below. This one happened on 0f23dedc9. I was > running some fuzz testing and had started around 20 sessions > concurrently. > > (gdb) bt > #0 in GrantLockLocal at lock.c:1758 > #1 in GrantAwaitedLock at lock.c:1840 > #2 in LockErrorCleanup at proc.c:809 > #3 in AbortTransaction at xact.c:2846 > #4 in AbortCurrentTransactionInternal at xact.c:3520 > #5 in AbortCurrentTransaction at xact.c:3449 > #6 in PostgresMain at postgres.c:4535 > #7 in BackendMain at backend_startup.c:107 > #8 in postmaster_child_launch at launch_backend.c:274 > #9 in BackendStartup at postmaster.c:3391 > #10 in ServerLoop at postmaster.c:1678 > #11 in PostmasterMain at postmaster.c:1376 > #12 in main at main.c:224 > > It seems that the lock request is not granted as expected, since > locallock->lockOwners is a NULL pointer. > > (gdb) p locallock->lockOwners > $4 = (LOCALLOCKOWNER *) 0x0 > (gdb) p locallock->numLockOwners > $5 = 0 > (gdb) p locallock->maxLockOwners > $6 = 8 > > Unfortunately, I don't have a reliable way to trigger this issue. I'm > wondering if anyone has any insights into what might be happening. I don't know how that can happen, but I suspect commit 3c0fd64fec because it changed things in that area. If you can find a way to reproduce that even sporadically, that would be very helpful! -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, Dec 18, 2024 at 5:36 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I don't know how that can happen, but I suspect commit 3c0fd64fec > because it changed things in that area. If you can find a way to > reproduce that even sporadically, that would be very helpful! Thank you for the information. I'll try running the same pattern of fuzz testing based on the code before commit 3c0fd64fec to see if this SIGSEGV happens again. I'm not sure how helpful this will be, though, as it usually takes several days, or sometimes weeks, to trigger this issue. Thanks Richard