Thread: Still something fishy in the fastpath lock stuff

Still something fishy in the fastpath lock stuff

From
Tom Lane
Date:
Buildfarm member prairiedog crashed on today's HEAD with this:

TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240)

I have the core file, and gdb says:

#0  0x90047dac in kill ()
#1  0x9012d7b4 in abort ()
#2  0x003b1c38 in ExceptionalCondition (conditionName=0x0, errorType=0x25 "", fileName=0x2 "", lineNumber=109) at
assert.c:54
#3  0x002960bc in RemoveLocalLock (locallock=0x280b414) at lock.c:1240
#4  0x002968a4 in LockReleaseAll (lockmethodid=1, allLocks=0 '\0') at lock.c:2087
#5  0x00299550 in ProcReleaseLocks (isCommit=1 '\001') at proc.c:752
#6  0x003de2d4 in ResourceOwnerReleaseInternal (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001',
isTopLevel=1'\001') at resowner.c:307
 
#7  0x003de71c in ResourceOwnerRelease (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=1
'\001')at resowner.c:212
 
#8  0x00083b7c in CommitTransaction () at xact.c:1998
#9  0x00083ea4 in CommitTransactionCommand () at xact.c:2713
#10 0x002ab9b4 in finish_xact_command () at postgres.c:2408
#11 0x002af2b0 in exec_simple_query (query_string=0x2807e1c "CREATE TABLE t3 (name TEXT, n INTEGER);") at
postgres.c:1076
#12 0x002b08bc in PostgresMain (argc=41975324, argv=0xbfffdd3c, dbname=0x2800b6c "regression", username=0x51 "") at
postgres.c:4004
#13 0x0023d5bc in ServerLoop () at postmaster.c:4089
#14 0x0023f11c in PostmasterMain (argc=6, argv=0x46d9d0) at postmaster.c:1222
#15 0x001bb89c in main (argc=6, argv=0x2100710) at main.c:203

Since this machine has only been running the buildfarm for a week,
I rather imagine we will see more of these.  Thoughts?

(This is a PPC machine, but only a single processor, so it's hard
to see how memory ordering issues might enter into it ...)
        regards, tom lane



Re: Still something fishy in the fastpath lock stuff

From
Robert Haas
Date:
On Tue, Mar 25, 2014 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Buildfarm member prairiedog crashed on today's HEAD with this:
>
> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 1240)
>
> I have the core file, and gdb says:
>
> #0  0x90047dac in kill ()
> #1  0x9012d7b4 in abort ()
> #2  0x003b1c38 in ExceptionalCondition (conditionName=0x0, errorType=0x25 "", fileName=0x2 "", lineNumber=109) at
assert.c:54
> #3  0x002960bc in RemoveLocalLock (locallock=0x280b414) at lock.c:1240
> #4  0x002968a4 in LockReleaseAll (lockmethodid=1, allLocks=0 '\0') at lock.c:2087
> #5  0x00299550 in ProcReleaseLocks (isCommit=1 '\001') at proc.c:752
> #6  0x003de2d4 in ResourceOwnerReleaseInternal (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001',
isTopLevel=1'\001') at resowner.c:307
 
> #7  0x003de71c in ResourceOwnerRelease (owner=0x2802184, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001',
isTopLevel=1'\001') at resowner.c:212
 
> #8  0x00083b7c in CommitTransaction () at xact.c:1998
> #9  0x00083ea4 in CommitTransactionCommand () at xact.c:2713
> #10 0x002ab9b4 in finish_xact_command () at postgres.c:2408
> #11 0x002af2b0 in exec_simple_query (query_string=0x2807e1c "CREATE TABLE t3 (name TEXT, n INTEGER);") at
postgres.c:1076
> #12 0x002b08bc in PostgresMain (argc=41975324, argv=0xbfffdd3c, dbname=0x2800b6c "regression", username=0x51 "") at
postgres.c:4004
> #13 0x0023d5bc in ServerLoop () at postmaster.c:4089
> #14 0x0023f11c in PostmasterMain (argc=6, argv=0x46d9d0) at postmaster.c:1222
> #15 0x001bb89c in main (argc=6, argv=0x2100710) at main.c:203
>
> Since this machine has only been running the buildfarm for a week,
> I rather imagine we will see more of these.  Thoughts?
>
> (This is a PPC machine, but only a single processor, so it's hard
> to see how memory ordering issues might enter into it ...)

Well, it's possible that the issue could be related to compiler
reordering, since it's still the rule that SpinLockAcquire/Release
must be memory barriers but need not be compiler barriers, and
FastPathStrongRelationLocks is not volatile-ized.  I really think we
should change that rule; it leads to ugly code, and bugs.  But to
determine whether that's the issue, we'd probably need to disassemble
the relevant code and see whether the compiler did in fact shift
things around.

But it seems equally possible that the bug is the result of some more
pedestrian error.  I don't have a great idea where to look for such a
mistake, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Still something fishy in the fastpath lock stuff

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 25, 2014 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Since this machine has only been running the buildfarm for a week,
>> I rather imagine we will see more of these.  Thoughts?
>> 
>> (This is a PPC machine, but only a single processor, so it's hard
>> to see how memory ordering issues might enter into it ...)

> Well, it's possible that the issue could be related to compiler
> reordering, since it's still the rule that SpinLockAcquire/Release
> must be memory barriers but need not be compiler barriers, and
> FastPathStrongRelationLocks is not volatile-ized.

Yeah, I was just about to complain about that.  What made you think
you could get away with ignoring that coding rule?

> I really think we
> should change that rule; it leads to ugly code, and bugs.

No doubt, but are we prepared to believe that we have working "compiler
barriers" other than volatile?  (Which, I will remind you, is in the C
standard.  Unlike "compiler barriers".)

> But to
> determine whether that's the issue, we'd probably need to disassemble
> the relevant code and see whether the compiler did in fact shift
> things around.

I looked at the asm code around line 1240, and it seems all right, but
if this is a locking problem then the actual failure was probably in some
preceding spinlocked segment.  I haven't the patience to look at each one
of these segments, and it's sort of irrelevant anyway, because whether
this particular machine did reorder the asm code or not, the fact remains
that *this C code is broken*.  The fact that you don't like the existing
coding rules concerning spinlocks is not license to write unsafe code.
        regards, tom lane



Re: Still something fishy in the fastpath lock stuff

From
Robert Haas
Date:
On Tue, Mar 25, 2014 at 11:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Well, it's possible that the issue could be related to compiler
>> reordering, since it's still the rule that SpinLockAcquire/Release
>> must be memory barriers but need not be compiler barriers, and
>> FastPathStrongRelationLocks is not volatile-ized.
>
> Yeah, I was just about to complain about that.  What made you think
> you could get away with ignoring that coding rule?

Um, nothing.  It was just an oversight.  If the implication here is
that I deliberately ignore PostgreSQL coding rules, then I resent
that.  If the implication is that I don't understand them, the fact
that I made explicit reference to the existence of this precise rule
in my previous response would seem to militate against that
interpretation.

>> I really think we
>> should change that rule; it leads to ugly code, and bugs.
>
> No doubt, but are we prepared to believe that we have working "compiler
> barriers" other than volatile?  (Which, I will remind you, is in the C
> standard.  Unlike "compiler barriers".)

I'm prepared to believe that it would take some time to be certain
that we've got that right on all compilers we support.  The most
common ones are easily dealt with, I think, but it might be harder to
verify the behavior on more obscure compilers.  But I'd rather deal
with bullet-proofing SpinLockAcquire() and SpinLockRelease() *in
general* than deal with bullet-proofing every call site individually,
which is what we have to do right now.

Keep in mind that we don't require the use of volatile for critical
sections involving LWLockAcquire() and LWLockRelease(), so we're
implicitly assume that those functions DO act as compiler barriers.  I
think we're mostly just hoping that the compiler isn't smart enough to
make deductions about the behavior of a function in another
translation unit, or that if it is the combination of
SpinLockAcquire() + SpinLockRelease() adds up to a compiler barrier
even if they aren't each individually a barrier.  Whatever the reason,
our coding rule is currently that you've got to volatile-ize all
references within sections protected by a spinlock, but not sections
protected by an lwlock.  Without intending to defend the fact that I
failed to do it correctly in this case, I don't find it surprising
that people flub that occasionally: this is not the first time someone
has made such an error and it doubtless won't be the last.  One of the
things I did to Andres's replication slot patch when reviewing it was
fix numerous errors of this type.  I don't think he'll be the last
person to get it wrong, either.

> I looked at the asm code around line 1240, and it seems all right, but
> if this is a locking problem then the actual failure was probably in some
> preceding spinlocked segment.  I haven't the patience to look at each one
> of these segments, and it's sort of irrelevant anyway, because whether
> this particular machine did reorder the asm code or not, the fact remains
> that *this C code is broken*.  The fact that you don't like the existing
> coding rules concerning spinlocks is not license to write unsafe code.

Tom, I never said otherwise.  Flaming me is neither necessary nor
productive.  It seems to me that the simplest thing to do might be
just this:

-static FastPathStrongRelationLockData *FastPathStrongRelationLocks;
+static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;

Do you see any problem with that approach?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Still something fishy in the fastpath lock stuff

From
Heikki Linnakangas
Date:
On 03/26/2014 06:59 AM, Robert Haas wrote:
> On Tue, Mar 25, 2014 at 11:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I really think we
>>> should change that rule; it leads to ugly code, and bugs.
>>
>> No doubt, but are we prepared to believe that we have working "compiler
>> barriers" other than volatile?  (Which, I will remind you, is in the C
>> standard.  Unlike "compiler barriers".)
>
> I'm prepared to believe that it would take some time to be certain
> that we've got that right on all compilers we support.  The most
> common ones are easily dealt with, I think, but it might be harder to
> verify the behavior on more obscure compilers.  But I'd rather deal
> with bullet-proofing SpinLockAcquire() and SpinLockRelease() *in
> general* than deal with bullet-proofing every call site individually,
> which is what we have to do right now.

+1. To support porting to new compilers, we can fall back to e.g calling 
a dummy function through a function pointer, if we don't have a proper 
implementation.

Aside from the correctness issue: A while ago, while working on the 
xloginsert slots stuff, I looked at the assembler that gcc generated for 
ReserverXLogInsertLocation(). That function is basically:
    SpinLockAcquire()    <modify and read a few variables in shared memory>    SpinLockRelease()    <fairly expensive
calculationsbased on values read>
 

Gcc decided to move the release of the lock so that it became:
    SpinLockAcquire()    <modify and read a few variables in shared memory>    <fairly expensive calculations based on
valuesread>    SpinLockRelease()
 

I went through some effort while writing the patch to make sure the 
spinlock is held for as short duration as possible. But gcc undid that 
to some extent :-(.

I tried to put a compiler barrier there and run some short performance 
tests, but it didn't make any noticeable difference. So it doesn't seem 
matter in practice - maybe the CPU reorders things anyway, or the 
calculations are not expensive enough to matter after all.

I just looked at the assembler generated for LWLockAcquire, and a 
similar thing is happening there. The code is:
...
641    /* We are done updating shared state of the lock itself. */
642    SpinLockRelease(&lock->mutex);
643
644    TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
645
646    /* Add lock to list of locks held by this backend */
647    held_lwlocks[num_held_lwlocks++] = l;...

gcc reordered part of the "held_lwlocks" assignment after releasing the 
spinlock:

.L21:.loc 1 647 0movslq    num_held_lwlocks(%rip), %raxaddq    $16, %r12
.LVL23:.loc 1 652 0testl    %ebx, %ebx.loc 1 642 0movb    $0, (%r14)    <--- SpinLockRelease.loc 1 652 0leal
-1(%rbx),%r13d.loc 1 647 0leal    1(%rax), %edxmovq    %r14, held_lwlocks(,%rax,8)
 
.LVL24:movl    %edx, num_held_lwlocks(%rip)

The held_lwlocks assignment was deliberately put outside the 
spinlock-protected section, to minimize the time the spinlock is held, 
but the compiler moved some of it back in: the basic block .L21.

A compiler barrier would avoid that kind of reordering. Not using 
volatile would also allow the compiler to reorder and optimize the 
instructions *within* the spinlock-protected block, which might shave a 
few instructions while a spinlock is held. Granted, spinlock-protected 
blocks are small so there isn't much room for optimization, but still.

- Heikki



Re: Still something fishy in the fastpath lock stuff

From
Robert Haas
Date:
On Wed, Mar 26, 2014 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems to me that the simplest thing to do might be
> just this:
>
> -static FastPathStrongRelationLockData *FastPathStrongRelationLocks;
> +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
>
> Do you see any problem with that approach?

Hearing nothing, I went ahead and did this.  I see that the failure on
prairiedog only occurred once, so I don't know how we'll know whether
this fixed the problem that caused that failure or merely fixed a
problem that could cause a failure, but I'm not sure what else we can
really do at this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Still something fishy in the fastpath lock stuff

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 26, 2014 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It seems to me that the simplest thing to do might be
>> just this:
>> 
>> -static FastPathStrongRelationLockData *FastPathStrongRelationLocks;
>> +static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
>> 
>> Do you see any problem with that approach?

> Hearing nothing, I went ahead and did this.

Sorry, I've been preoccupied with personal matters for the past little
while, and hadn't had time to think about this.  After a review of the
code it looks probably all right to me.

> I see that the failure on
> prairiedog only occurred once, so I don't know how we'll know whether
> this fixed the problem that caused that failure or merely fixed a
> problem that could cause a failure, but I'm not sure what else we can
> really do at this point.

Well, it's only one failure, but it occurred after prairiedog had run
a grand total of 20 buildfarm cycles on 9.2 and newer, so I'm thinking
that the probability of failure on that machine is more than epsilon.
If we don't see it again for a good long while then I'll believe that
this fixed it.
        regards, tom lane