Thread: Still something fishy in the fastpath lock stuff
Buildfarm member prairiedog crashed on today's HEAD with this: TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240) I have the core file, and gdb says: #0 0x90047dac in kill () #1 0x9012d7b4 in abort () #2 0x003b1c38 in ExceptionalCondition (conditionName=0x0, errorType=0x25 "", fileName=0x2 "", lineNumber=109) at assert.c:54 #3 0x002960bc in RemoveLocalLock (locallock=0x280b414) at lock.c:1240 #4 0x002968a4 in LockReleaseAll (lockmethodid=1, allLocks=0 '\0') at lock.c:2087 #5 0x00299550 in ProcReleaseLocks (isCommit=1 '\001') at proc.c:752 #6 0x003de2d4 in ResourceOwnerReleaseInternal (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1'\001') at resowner.c:307 #7 0x003de71c in ResourceOwnerRelease (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1 '\001')at resowner.c:212 #8 0x00083b7c in CommitTransaction () at xact.c:1998 #9 0x00083ea4 in CommitTransactionCommand () at xact.c:2713 #10 0x002ab9b4 in finish_xact_command () at postgres.c:2408 #11 0x002af2b0 in exec_simple_query (query_string=0x2807e1c "CREATE TABLE t3 (name TEXT, n INTEGER);") at postgres.c:1076 #12 0x002b08bc in PostgresMain (argc=41975324, argv=0xbfffdd3c, dbname=0x2800b6c "regression", username=0x51 "") at postgres.c:4004 #13 0x0023d5bc in ServerLoop () at postmaster.c:4089 #14 0x0023f11c in PostmasterMain (argc=6, argv=0x46d9d0) at postmaster.c:1222 #15 0x001bb89c in main (argc=6, argv=0x2100710) at main.c:203 Since this machine has only been running the buildfarm for a week, I rather imagine we will see more of these. Thoughts? (This is a PPC machine, but only a single processor, so it's hard to see how memory ordering issues might enter into it ...) regards, tom lane
On Tue, Mar 25, 2014 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Buildfarm member prairiedog crashed on today's HEAD with this: > > TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240) > > I have the core file, and gdb says: > > #0 0x90047dac in kill () > #1 0x9012d7b4 in abort () > #2 0x003b1c38 in ExceptionalCondition (conditionName=0x0, errorType=0x25 "", fileName=0x2 "", lineNumber=109) at assert.c:54 > #3 0x002960bc in RemoveLocalLock (locallock=0x280b414) at lock.c:1240 > #4 0x002968a4 in LockReleaseAll (lockmethodid=1, allLocks=0 '\0') at lock.c:2087 > #5 0x00299550 in ProcReleaseLocks (isCommit=1 '\001') at proc.c:752 > #6 0x003de2d4 in ResourceOwnerReleaseInternal (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1'\001') at resowner.c:307 > #7 0x003de71c in ResourceOwnerRelease (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1'\001') at resowner.c:212 > #8 0x00083b7c in CommitTransaction () at xact.c:1998 > #9 0x00083ea4 in CommitTransactionCommand () at xact.c:2713 > #10 0x002ab9b4 in finish_xact_command () at postgres.c:2408 > #11 0x002af2b0 in exec_simple_query (query_string=0x2807e1c "CREATE TABLE t3 (name TEXT, n INTEGER);") at postgres.c:1076 > #12 0x002b08bc in PostgresMain (argc=41975324, argv=0xbfffdd3c, dbname=0x2800b6c "regression", username=0x51 "") at postgres.c:4004 > #13 0x0023d5bc in ServerLoop () at postmaster.c:4089 > #14 0x0023f11c in PostmasterMain (argc=6, argv=0x46d9d0) at postmaster.c:1222 > #15 0x001bb89c in main (argc=6, argv=0x2100710) at main.c:203 > > Since this machine has only been running the buildfarm for a week, > I rather imagine we will see more of these. Thoughts? > > (This is a PPC machine, but only a single processor, so it's hard > to see how memory ordering issues might enter into it ...) Well, it's possible that the issue could be related to compiler reordering, since it's still the rule that SpinLockAcquire/Release must be memory barriers but need not be compiler barriers, and FastPathStrongRelationLocks is not volatile-ized. I really think we should change that rule; it leads to ugly code, and bugs. But to determine whether that's the issue, we'd probably need to disassemble the relevant code and see whether the compiler did in fact shift things around. But it seems equally possible that the bug is the result of some more pedestrian error. I don't have a great idea where to look for such a mistake, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 25, 2014 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Since this machine has only been running the buildfarm for a week, >> I rather imagine we will see more of these. Thoughts? >> >> (This is a PPC machine, but only a single processor, so it's hard >> to see how memory ordering issues might enter into it ...) > Well, it's possible that the issue could be related to compiler > reordering, since it's still the rule that SpinLockAcquire/Release > must be memory barriers but need not be compiler barriers, and > FastPathStrongRelationLocks is not volatile-ized. Yeah, I was just about to complain about that. What made you think you could get away with ignoring that coding rule? > I really think we > should change that rule; it leads to ugly code, and bugs. No doubt, but are we prepared to believe that we have working "compiler barriers" other than volatile? (Which, I will remind you, is in the C standard. Unlike "compiler barriers".) > But to > determine whether that's the issue, we'd probably need to disassemble > the relevant code and see whether the compiler did in fact shift > things around. I looked at the asm code around line 1240, and it seems all right, but if this is a locking problem then the actual failure was probably in some preceding spinlocked segment. I haven't the patience to look at each one of these segments, and it's sort of irrelevant anyway, because whether this particular machine did reorder the asm code or not, the fact remains that *this C code is broken*. The fact that you don't like the existing coding rules concerning spinlocks is not license to write unsafe code. regards, tom lane
On Tue, Mar 25, 2014 at 11:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Well, it's possible that the issue could be related to compiler >> reordering, since it's still the rule that SpinLockAcquire/Release >> must be memory barriers but need not be compiler barriers, and >> FastPathStrongRelationLocks is not volatile-ized. > > Yeah, I was just about to complain about that. What made you think > you could get away with ignoring that coding rule? Um, nothing. It was just an oversight. If the implication here is that I deliberately ignore PostgreSQL coding rules, then I resent that. If the implication is that I don't understand them, the fact that I made explicit reference to the existence of this precise rule in my previous response would seem to militate against that interpretation. >> I really think we >> should change that rule; it leads to ugly code, and bugs. > > No doubt, but are we prepared to believe that we have working "compiler > barriers" other than volatile? (Which, I will remind you, is in the C > standard. Unlike "compiler barriers".) I'm prepared to believe that it would take some time to be certain that we've got that right on all compilers we support. The most common ones are easily dealt with, I think, but it might be harder to verify the behavior on more obscure compilers. But I'd rather deal with bullet-proofing SpinLockAcquire() and SpinLockRelease() *in general* than deal with bullet-proofing every call site individually, which is what we have to do right now. Keep in mind that we don't require the use of volatile for critical sections involving LWLockAcquire() and LWLockRelease(), so we're implicitly assume that those functions DO act as compiler barriers. I think we're mostly just hoping that the compiler isn't smart enough to make deductions about the behavior of a function in another translation unit, or that if it is the combination of SpinLockAcquire() + SpinLockRelease() adds up to a compiler barrier even if they aren't each individually a barrier. Whatever the reason, our coding rule is currently that you've got to volatile-ize all references within sections protected by a spinlock, but not sections protected by an lwlock. Without intending to defend the fact that I failed to do it correctly in this case, I don't find it surprising that people flub that occasionally: this is not the first time someone has made such an error and it doubtless won't be the last. One of the things I did to Andres's replication slot patch when reviewing it was fix numerous errors of this type. I don't think he'll be the last person to get it wrong, either. > I looked at the asm code around line 1240, and it seems all right, but > if this is a locking problem then the actual failure was probably in some > preceding spinlocked segment. I haven't the patience to look at each one > of these segments, and it's sort of irrelevant anyway, because whether > this particular machine did reorder the asm code or not, the fact remains > that *this C code is broken*. The fact that you don't like the existing > coding rules concerning spinlocks is not license to write unsafe code. Tom, I never said otherwise. Flaming me is neither necessary nor productive. It seems to me that the simplest thing to do might be just this: -static FastPathStrongRelationLockData *FastPathStrongRelationLocks; +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks; Do you see any problem with that approach? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/26/2014 06:59 AM, Robert Haas wrote: > On Tue, Mar 25, 2014 at 11:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I really think we >>> should change that rule; it leads to ugly code, and bugs. >> >> No doubt, but are we prepared to believe that we have working "compiler >> barriers" other than volatile? (Which, I will remind you, is in the C >> standard. Unlike "compiler barriers".) > > I'm prepared to believe that it would take some time to be certain > that we've got that right on all compilers we support. The most > common ones are easily dealt with, I think, but it might be harder to > verify the behavior on more obscure compilers. But I'd rather deal > with bullet-proofing SpinLockAcquire() and SpinLockRelease() *in > general* than deal with bullet-proofing every call site individually, > which is what we have to do right now. +1. To support porting to new compilers, we can fall back to e.g calling a dummy function through a function pointer, if we don't have a proper implementation. Aside from the correctness issue: A while ago, while working on the xloginsert slots stuff, I looked at the assembler that gcc generated for ReserverXLogInsertLocation(). That function is basically: SpinLockAcquire() <modify and read a few variables in shared memory> SpinLockRelease() <fairly expensive calculationsbased on values read> Gcc decided to move the release of the lock so that it became: SpinLockAcquire() <modify and read a few variables in shared memory> <fairly expensive calculations based on valuesread> SpinLockRelease() I went through some effort while writing the patch to make sure the spinlock is held for as short duration as possible. But gcc undid that to some extent :-(. I tried to put a compiler barrier there and run some short performance tests, but it didn't make any noticeable difference. So it doesn't seem matter in practice - maybe the CPU reorders things anyway, or the calculations are not expensive enough to matter after all. I just looked at the assembler generated for LWLockAcquire, and a similar thing is happening there. The code is: ... 641 /* We are done updating shared state of the lock itself. */ 642 SpinLockRelease(&lock->mutex); 643 644 TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode); 645 646 /* Add lock to list of locks held by this backend */ 647 held_lwlocks[num_held_lwlocks++] = l;... gcc reordered part of the "held_lwlocks" assignment after releasing the spinlock: .L21:.loc 1 647 0movslq num_held_lwlocks(%rip), %raxaddq $16, %r12 .LVL23:.loc 1 652 0testl %ebx, %ebx.loc 1 642 0movb $0, (%r14) <--- SpinLockRelease.loc 1 652 0leal -1(%rbx),%r13d.loc 1 647 0leal 1(%rax), %edxmovq %r14, held_lwlocks(,%rax,8) .LVL24:movl %edx, num_held_lwlocks(%rip) The held_lwlocks assignment was deliberately put outside the spinlock-protected section, to minimize the time the spinlock is held, but the compiler moved some of it back in: the basic block .L21. A compiler barrier would avoid that kind of reordering. Not using volatile would also allow the compiler to reorder and optimize the instructions *within* the spinlock-protected block, which might shave a few instructions while a spinlock is held. Granted, spinlock-protected blocks are small so there isn't much room for optimization, but still. - Heikki
On Wed, Mar 26, 2014 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > It seems to me that the simplest thing to do might be > just this: > > -static FastPathStrongRelationLockData *FastPathStrongRelationLocks; > +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks; > > Do you see any problem with that approach? Hearing nothing, I went ahead and did this. I see that the failure on prairiedog only occurred once, so I don't know how we'll know whether this fixed the problem that caused that failure or merely fixed a problem that could cause a failure, but I'm not sure what else we can really do at this point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Mar 26, 2014 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> It seems to me that the simplest thing to do might be >> just this: >> >> -static FastPathStrongRelationLockData *FastPathStrongRelationLocks; >> +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks; >> >> Do you see any problem with that approach? > Hearing nothing, I went ahead and did this. Sorry, I've been preoccupied with personal matters for the past little while, and hadn't had time to think about this. After a review of the code it looks probably all right to me. > I see that the failure on > prairiedog only occurred once, so I don't know how we'll know whether > this fixed the problem that caused that failure or merely fixed a > problem that could cause a failure, but I'm not sure what else we can > really do at this point. Well, it's only one failure, but it occurred after prairiedog had run a grand total of 20 buildfarm cycles on 9.2 and newer, so I'm thinking that the probability of failure on that machine is more than epsilon. If we don't see it again for a good long while then I'll believe that this fixed it. regards, tom lane