Re: Spinlocks and compiler/memory barriers - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Spinlocks and compiler/memory barriers
Date
Msg-id 20140630151704.GN26930@awork2.anarazel.de
Whole thread Raw
In response to Re: Spinlocks and compiler/memory barriers  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Spinlocks and compiler/memory barriers  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2014-06-30 10:05:44 -0400, Robert Haas wrote:
> On Mon, Jun 30, 2014 at 9:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think it continues in the tradition of making s_lock.h ever harder to
> > follow. But it's still better than what we have now from a correctness
> > perspective.
> 
> Well, as you and I have discussed before, someday we probably need to
> get rid of s_lock.h or at least refactor it heavily, but let's do one
> thing at a time.

Well, that task gets harder by making it more complex...

(will answer separately about sun studio)

> > Solaris seems to run with TSO enabled for SPARC (whereas linux uses RMO,
> > relaxed memory ordering), so it's probably fine to just use the compiler
> > barrier.
> 
> If it isn't, that's a change that has nothing to do with making
> spinlock operations compiler barriers and everything to do with fixing
> a bug in the existing implementation.

Well, it actually has. If we start to remove volatiles from critical
sections the compiler will schedule stores closer to the spinlock
release and reads more freely - making externally visible ordering
violations more likely. Especially on itanic.

So, I agree that we need to fix unlocks that aren't sufficiently strong
memory barriers, but it does get more urgent by not relying on volatile
inside criticial sections anymore.

>  Basically, every platform we support today has a
> spinlock implementation that is supposed to prevent CPU reordering
> across the acquire and release operations but might not prevent
> compiler reordering.  IOW, S_LOCK() should be a CPU acquire fence, and
> S_UNLOCK() should be a CPU release fence.

Well, I think s_lock.h comes woefully short of that goal on several
platforms. Scary.

> Now, we want to make these
> operations compiler fences as well, and AIUI your proposal for that is
> to make NEW_S_UNLOCK(lock) = out of line function call + S_LOCK(dummy)
> + S_UNLOCK(dummy) + S_UNLOCK(lock).

My proposal was to use barrier.h provided barriers as long as it
provides a native implementation. If neither a compiler nor a memory
barrier is provided my proposal was to fall back to the memory barrier
emulation we already have, but move it out of line (to avoid reordering
issues). The reason for doing so is that the release *has* to be a
release barrier.

To avoid issues with recursion the S_UNLOCK() used in the out of line
memory barrier implementation used a modified S_UNLOCK that doesn't
include a barrier.

> My proposal is to make
> NEW_S_UNLOCK(lock) = out of line function call + S_UNLOCK(lock), which
> I think is strictly better.  There's zip to guarantee that adding
> S_LOCK(dummy) + S_UNLOCK(dummy) is doing anything we want in there,
> and it's definitely going to be worse for performance.

Uh? At the very least doing a S_LOCK() guarantees that we're doing some
sort of memory barrier, which your's doesn't at all. That'd actually fix
the majority of platforms with borked release semantics because pretty
much all tas() implementations are full barriers.

> The other thing that I don't like about your proposal has to do with
> the fact that the support matrix for barrier.h and s_lock.h are not
> identical.  If s_lock.h is confusing to you today, making it further
> intertwined with barrier.h is not going to make things better; at
> least, that confuses the crap out of me.

Uh. It actually *removed* the confusing edge of the dependency. It's
rather confusing that memory barriers rely spinlocks and not the other
way round. I think we actually should do that unconditionally,
independent of any other changes. The only reasons barrier.h includes
s_lock.h is that dummy_spinlock is declared and that the memory barrier
is declared inline.

> > 1)
> > Afaics ARM will fall back to this for older gccs - and ARM is weakly
> > ordered. I think I'd just also use swpb to release the lock. Something
> > like
> > #define S_INIT_LOCK(lock)       (*(lock)) = 0);
> >
> > #define S_UNLOCK s_unlock
> > static __inline__ void
> > s_unlock(volatile slock_t *lock)
> > {
> >         register slock_t _res = 0;
> >
> >         __asm__ __volatile__(
> >                 "       swpb    %0, %0, [%2]    \n"
> > :               "+r"(_res), "+m"(*lock)
> > :               "r"(lock)
> > :               "memory");
> >         Assert(_res == 1); // lock should be held by us
> > }
> >
> > 2)
> > Afaics it's also wrong for sparc on linux (and probably the BSDs) where
> > relaxed memory ordering is used.
> >
> > 3)
> > Also looks wrong on mips which needs a full memory barrier.
> 
> You're mixing apples and oranges again.

No, I'm not.

> If the existing definitions
> aren't CPU barriers, they're already broken, and we should fix that
> and back-patch.

Yea, and that gets harder if we do it after master has changed
incompatibly. And, as explained above, we need to fix S_UNLOCK() to be a
memory barrier before we can remove volatiles - which is the goal of
your patch, no?

> On the other hand, the API contract change making
> these into compiler barriers is a master-only change.  I think for
> purposes of this patch we should assume that the existing code is
> sufficient to prevent CPU reordering and just focus on fixing compiler
> ordering problems.  Whatever needs to change on the CPU-reordering end
> of things is a separate commit.

I'm not arguing against that it should be a separate commit. Just that
the proper memory barrier bit has to come first.

> >> @@ -826,6 +845,9 @@ spin_delay(void)
> >>  }
> >>  #endif
> >>
> >> +#define S_UNLOCK(lock)       \
> >> +     do { MemoryBarrier(); (*(lock)) = 0); } while (0)
> >> +
> >>  #endif
> >
> > Hm. Won't this end up being a full memory barrier on x86-64 even though
> > a compiler barrier would suffice on x86? In my measurements on linux
> > x86-64 that has a prety hefty performance penalty on NUMA systems.
> 
> Ah, crap, that should have been _ReadWriteBarrier().

I think that needs a
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: delta relations in AFTER triggers
Next
From: Stephen Frost
Date:
Subject: Re: pgaudit - an auditing extension for PostgreSQL