Re: LWLock optimization for multicore Power machines - Mailing list pgsql-hackers

From Andres Freund
Subject Re: LWLock optimization for multicore Power machines
Date
Msg-id 20170403185613.v452kjchgqtgquos@alap3.anarazel.de
Whole thread Raw
In response to Re: LWLock optimization for multicore Power machines  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-hackers
Hi,

On 2017-03-31 13:38:31 +0300, Alexander Korotkov wrote:
> > It seems that on this platform definition of atomics should be provided by
> > fallback.h.  But it doesn't because I already defined PG_HAVE_ATOMIC_U32_SUPPORT
> > in arch-ppc.h.  I think in this case we shouldn't provide ppc-specific
> > implementation of pg_atomic_fetch_mask_add_u32().  However, I don't know
> > how to do this assuming arch-ppc.h is included before compiler-specific
> > headers.  Thus, in arch-ppc.h we don't know yet if we would find
> > implementation of atomics for this platform.  One possible solution is to
> > provide assembly implementation for all atomics in arch-ppc.h.

That'd probably not be good use of effort.


> BTW, implementation for all atomics in arch-ppc.h would be too invasive and
> shouldn't be considered for v10.
> However, I made following workaround: declare pg_atomic_uint32 and
> pg_atomic_fetch_mask_add_u32_impl() only when we know that generic-gcc.h
> would declare gcc-based atomics.
> Could you, please, check it on Apple PPC?

That doesn't sound crazy to me.


> +/*
> + * pg_atomic_fetch_mask_add_u32 - atomically check that masked bits in variable
> + * and if they are clear then add to variable.

Hm, expanding on this wouldn't hurt.


> + * Returns the value of ptr before the atomic operation.

Sounds a bit like it'd return the pointer itself, not what it points to...


> + * Full barrier semantics.

Does it actually have full barrier semantics?  Based on a quick skim I
don't think the ppc implementation doesn't?



> +#if defined(HAVE_ATOMICS) \
> +    && (defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS))
> +
> +/*
> + * Declare pg_atomic_uint32 structure before generic-gcc.h does it in order to
> + * use it in function arguments.
> + */

If we do this, then the other side needs a pointer to keep this up2date.

> +#define PG_HAVE_ATOMIC_U32_SUPPORT
> +typedef struct pg_atomic_uint32
> +{
> +    volatile uint32 value;
> +} pg_atomic_uint32;
> +
> +/*
> + * Optimized implementation of pg_atomic_fetch_mask_add_u32() for Power
> + * processors.  Atomic operations on Power processors are implemented using
> + * optimistic locking.  'lwarx' instruction 'reserves index', but that
> + * reservation could be broken on 'stwcx.' and then we have to retry.  Thus,
> + * each CAS operation is a loop.  But loop of CAS operation is two level nested
> + * loop.  Experiments on multicore Power machines shows that we can have huge
> + * benefit from having this operation done using single loop in assembly.
> + */
> +#define PG_HAVE_ATOMIC_FETCH_MASK_ADD_U32
> +static inline uint32
> +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> +                                  uint32 mask, uint32 increment)
> +{
> +    uint32        result,
> +                tmp;
> +
> +    __asm__ __volatile__(
> +    /* read *ptr and reserve index */
> +#ifdef USE_PPC_LWARX_MUTEX_HINT
> +    "    lwarx    %0,0,%5,1    \n"
> +#else
> +    "    lwarx    %0,0,%5        \n"
> +#endif

Hm, we should remove that hint bit stuff at some point...


> +    "    and        %1,%0,%3    \n" /* calculate '*ptr & mask" */
> +    "    cmpwi    %1,0        \n" /* compare '*ptr & mark' vs 0 */
> +    "    bne-    $+16        \n" /* exit on '*ptr & mark != 0' */
> +    "    add        %1,%0,%4    \n" /* calculate '*ptr + increment' */
> +    "    stwcx.    %1,0,%5        \n" /* try to store '*ptr + increment' into *ptr */
> +    "    bne-    $-24        \n" /* retry if index reservation is broken */
> +#ifdef USE_PPC_LWSYNC
> +    "    lwsync                \n"
> +#else
> +    "    isync                \n"
> +#endif
> +    : "=&r"(result), "=&r"(tmp), "+m"(*ptr)
> +    : "r"(mask), "r"(increment), "r"(ptr)
> +    : "memory", "cc");
> +    return result;
> +}

I'm not sure that the barrier semantics here are sufficient.


> +/*
> + * Generic implementation of pg_atomic_fetch_mask_add_u32() via loop
> + * of compare & exchange.
> + */
> +static inline uint32
> +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> +                                  uint32 mask_, uint32 add_)
> +{
> +    uint32        old_value;
> +
> +    /*
> +     * Read once outside the loop, later iterations will get the newer value
> +     * via compare & exchange.
> +     */
> +    old_value = pg_atomic_read_u32_impl(ptr);
> +
> +    /* loop until we've determined whether we could make an increment or not */
> +    while (true)
> +    {
> +        uint32        desired_value;
> +        bool        free;
> +
> +        desired_value = old_value;
> +        free = (old_value & mask_) == 0;
> +        if (free)
> +            desired_value += add_;
> +
> +        /*
> +         * Attempt to swap in the value we are expecting. If we didn't see
> +         * masked bits to be clear, that's just the old value. If we saw them
> +         * as clear, we'll attempt to make an increment. The reason that we
> +         * always swap in the value is that this doubles as a memory barrier.
> +         * We could try to be smarter and only swap in values if we saw the
> +         * maked bits as clear, but benchmark haven't shown it as beneficial
> +         * so far.
> +         *
> +         * Retry if the value changed since we last looked at it.
> +         */
> +        if (pg_atomic_compare_exchange_u32_impl(ptr, &old_value, desired_value))
> +            return old_value;
> +    }
> +    pg_unreachable();
> +}
> +#endif

Have you done x86 benchmarking?


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Improve OR conditions on joined columns (common star schema problem)
Next
From: Andres Freund
Date:
Subject: Re: Should we cacheline align PGXACT?