Re: xlc atomics - Mailing list pgsql-hackers

From Noah Misch
Subject Re: xlc atomics
Date
Msg-id 20160215171129.GA347322@tornado.leadboat.com
Whole thread Raw
In response to Re: xlc atomics  (Noah Misch <noah@leadboat.com>)
Responses Re: xlc atomics
Re: xlc atomics
List pgsql-hackers
On Sat, Jul 04, 2015 at 08:07:49PM -0400, Noah Misch wrote:
> On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote:
> > On 2015-07-04 18:40:41 -0400, Noah Misch wrote:
> > > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)".  Getting it working
> > > required the attached patch.
>
> > Will you apply? Having the ability to test change seems to put you in a
> > much better spot then me.
>
> I will.

That commit (0d32d2e) permitted things to compile and usually pass tests, but
I missed the synchronization bug.  Since 2015-10-01, the buildfarm has seen
sixteen duplicate-catalog-OID failures.  The compiler has always been xlC, and
the branch has been HEAD or 9.5.  Examples:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-15%2004%3A12%3A49
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-08%2001%3A20%3A16
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2015-12-11%2005%3A25%3A54

These suggested OidGenLock wasn't doing its job.  I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le.  The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap().  __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:


https://github.com/boostorg/smart_ptr/blob/boost-1.60.0/include/boost/smart_ptr/detail/sp_counted_base_vacpp_ppc.hpp#L55

http://redmine.gromacs.org/projects/gromacs/repository/entry/src/external/thread_mpi/include/thread_mpi/atomic/xlc_ppc.h?rev=v5.1.2#L112
https://www-01.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/asm_example.html

This patch's test case would have failed about half the time under today's
generic-xlc.h.  Fast machines run it in about 1s.  A test that detects the bug
99% of the time would run far longer, hence this compromise.


Archaeology enthusiasts may enjoy the bug's changing presentation over time.
LWLockAcquire() exposed it after commit ab5194e redesigned lwlock.c in terms
of atomics.  Assert builds were unaffected for commits in [ab5194e, bbfd7ed);
atomics.h asserts fattened code enough to mask the bug.  However, removing the
AssertPointerAlignment() from pg_atomic_fetch_or_u32() would have sufficed to
surface it in assert builds.  All builds demonstrated the bug once bbfd7ed
introduced xlC pg_attribute_noreturn(), which elicits a simpler instruction
sequence around the ExceptionalCondition() call embedded in each Assert.

nm

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [WIP] ALTER ... OWNER TO ... CASCADE
Next
From: Fujii Masao
Date:
Subject: Re: Existence check for suitable index in advance when concurrently refreshing.