Thread: anole: assorted stability problems

anole: assorted stability problems

From
Alvaro Herrera
Date:
In HEAD only.  Previous branches seem mostly clean, so there's something
going wrong.  Spinlocks going wrong perhaps?

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check
! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-09%2020%3A30%3A29
! PANIC:  semop(id=0) failed: Result too large
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-05%2018%3A39%3A38&stg=check
! FATAL:  semop(id=0) failed: File too large
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-03%2012%3A30%3A18
! PANIC:  semop(id=-1073741824) failed: Invalid argument
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-04-29%2004%3A30%3A25
! PANIC:  stuck spinlock (c00000000d335360) detected at lwlock.c:767
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-05-20 16:21:57 -0300, Alvaro Herrera wrote:
> In HEAD only.  Previous branches seem mostly clean, so there's something
> going wrong.  Spinlocks going wrong perhaps?
> 
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check
> ! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost

Hm. Anole hasn't reported reliably for a while before these. It's quite
possible that this is a ac++ portability problem around the
atomics. There's lots of other IA64 animals not having problems, but
they're not using ac++.

Greetings,

Andres Freund



Re: anole: assorted stability problems

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2015-05-20 16:21:57 -0300, Alvaro Herrera wrote:
> > In HEAD only.  Previous branches seem mostly clean, so there's something
> > going wrong.  Spinlocks going wrong perhaps?
> > 
> > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check
> > ! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
> > ! server closed the connection unexpectedly
> > !     This probably means the server terminated abnormally
> > !     before or while processing the request.
> > ! connection to server was lost
> 
> Hm. Anole hasn't reported reliably for a while before these. It's quite
> possible that this is a ac++ portability problem around the
> atomics. There's lots of other IA64 animals not having problems, but
> they're not using ac++.

Lots?  As far as I can tell, this is the only Itanium machine in the
buildfarm.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-05-20 16:44:12 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Hm. Anole hasn't reported reliably for a while before these. It's quite
> > possible that this is a ac++ portability problem around the
> > atomics. There's lots of other IA64 animals not having problems, but
> > they're not using ac++.
> 
> Lots?  As far as I can tell, this is the only Itanium machine in the
> buildfarm.

Uh. I'm pretty sure there were some back when that patch went in. And
there definitely used to be a couple earlier. I guess itanium really is
dying (mixed bad: It's a horrible architecture, but more coverage would
still be good).



Re: anole: assorted stability problems

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2015-05-20 16:44:12 -0300, Alvaro Herrera wrote:
>> Andres Freund wrote:
>>> Hm. Anole hasn't reported reliably for a while before these. It's quite
>>> possible that this is a ac++ portability problem around the
>>> atomics. There's lots of other IA64 animals not having problems, but
>>> they're not using ac++.

>> Lots?  As far as I can tell, this is the only Itanium machine in the
>> buildfarm.

> Uh. I'm pretty sure there were some back when that patch went in. And
> there definitely used to be a couple earlier. I guess itanium really is
> dying (mixed bad: It's a horrible architecture, but more coverage would
> still be good).

Since that machine is run by EDB, maybe we could persuade them to set up
a second critter on it that uses gcc.  That would at least help narrow
down whether it's a compiler-specific issue.

(It's times like this that I regret not working for Red Hat any more,
and having access to all their test hardware ...)
        regards, tom lane



Re: anole: assorted stability problems

From
Jim Nasby
Date:
On 5/20/15 3:09 PM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2015-05-20 16:44:12 -0300, Alvaro Herrera wrote:
>>> Andres Freund wrote:
>>> Lots?  As far as I can tell, this is the only Itanium machine in the
>>> buildfarm.
...
> (It's times like this that I regret not working for Red Hat any more,
> and having access to all their test hardware ...)

I believe pg.us could fund acquisition of older hardware for such 
purposes, I think it's more a question of hosting and someone to 
maintain them...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-05-20 16:21:57 -0300, Alvaro Herrera wrote:
> In HEAD only.  Previous branches seem mostly clean, so there's something
> going wrong.  Spinlocks going wrong perhaps?
>
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check
> ! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-09%2020%3A30%3A29
> ! PANIC:  semop(id=0) failed: Result too large
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost
>
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-05%2018%3A39%3A38&stg=check
> ! FATAL:  semop(id=0) failed: File too large
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-03%2012%3A30%3A18
> ! PANIC:  semop(id=-1073741824) failed: Invalid argument
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-04-29%2004%3A30%3A25
> ! PANIC:  stuck spinlock (c00000000d335360) detected at lwlock.c:767
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost

And now:

! FATAL:  semop(id=-2013265921) failed: Invalid argument
! CONTEXT:  SQL statement "CREATE TEMP TABLE brin_result (cid tid)"
! PL/pgSQL function inline_code_block line 20 at SQL statement
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

Uhm:

void
s_init_lock_sema(volatile slock_t *lock)
{static int    counter = 0;
*lock = (++counter) % NUM_SPINLOCK_SEMAPHORES;
}

One problem here might be that counter is signed. Once s_init_lock_sema
has been called often enough for counter to wrap around strange things
will happen.  But - I don't see why this codepatch would even be hit
once on this platform? It's only built !HAVE_SPINLOCKS which isn't the
case on anole.  So this appears to be an independent bug (9.4+).

One that has lead me to find an atomics bug (9.5+, stupid forgotten
codepath for atomics on spinlocks on semaphores) - which again should be
independent, because it's again is only relevant when spinlocks aren't
used...

I'll fix both.


But that leaves this problem.



Re: anole: assorted stability problems

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:

> > Uh. I'm pretty sure there were some back when that patch went in. And
> > there definitely used to be a couple earlier. I guess itanium really is
> > dying (mixed bad: It's a horrible architecture, but more coverage would
> > still be good).
> 
> Since that machine is run by EDB, maybe we could persuade them to set up
> a second critter on it that uses gcc.  That would at least help narrow
> down whether it's a compiler-specific issue.

I pinged EDB about this several days ago, and they have now set up
buildfarm member gharial running on the same machine, using gcc.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: anole: assorted stability problems

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> 
> > > Uh. I'm pretty sure there were some back when that patch went in. And
> > > there definitely used to be a couple earlier. I guess itanium really is
> > > dying (mixed bad: It's a horrible architecture, but more coverage would
> > > still be good).
> > 
> > Since that machine is run by EDB, maybe we could persuade them to set up
> > a second critter on it that uses gcc.  That would at least help narrow
> > down whether it's a compiler-specific issue.
> 
> I pinged EDB about this several days ago, and they have now set up
> buildfarm member gharial running on the same machine, using gcc.

FWIW gharial does not show any problem whatsoever.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: anole: assorted stability problems

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> Andres Freund <andres@anarazel.de> writes:
>>>> Uh. I'm pretty sure there were some back when that patch went in. And
>>>> there definitely used to be a couple earlier. I guess itanium really is
>>>> dying (mixed bad: It's a horrible architecture, but more coverage would
>>>> still be good).

>>> Since that machine is run by EDB, maybe we could persuade them to set up
>>> a second critter on it that uses gcc.  That would at least help narrow
>>> down whether it's a compiler-specific issue.

>> I pinged EDB about this several days ago, and they have now set up
>> buildfarm member gharial running on the same machine, using gcc.

> FWIW gharial does not show any problem whatsoever.

I'd hoped that commit 1b468a131bd260c9041484f78b8580c7f232d580 would
resolve this, but nope, anole is still getting occasional stuck spinlocks:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-06-28%2021%3A35%3A02
        regards, tom lane



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Sun, Jun 28, 2015 at 7:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd hoped that commit 1b468a131bd260c9041484f78b8580c7f232d580 would
> resolve this, but nope, anole is still getting occasional stuck spinlocks:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-06-28%2021%3A35%3A02

That sucks.  It was easy to see that the old fallback barrier
implementation wasn't re-entrant, but this one should be.  And now
that I look at it again, doesn't the failure message indicate that's
not the problem anyway?

! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
! PANIC:  stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770

That's just a straight-up SpinLockAcquire(), not a barrier call.

The May 5th failure looked like this:

! FATAL:  semop(id=0) failed: Result too large

The May 1st failure seems to have died here:

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Sun, Jun 28, 2015 at 9:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jun 28, 2015 at 7:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd hoped that commit 1b468a131bd260c9041484f78b8580c7f232d580 would
>> resolve this, but nope, anole is still getting occasional stuck spinlocks:
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-06-28%2021%3A35%3A02
>
> That sucks.  It was easy to see that the old fallback barrier
> implementation wasn't re-entrant, but this one should be.  And now
> that I look at it again, doesn't the failure message indicate that's
> not the problem anyway?
>
> ! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
> ! PANIC:  stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770
>
> That's just a straight-up SpinLockAcquire(), not a barrier call.
>
> The May 5th failure looked like this:
>
> ! FATAL:  semop(id=0) failed: Result too large
>
> The May 1st failure seems to have died here:

Ugh, hit "send" too soon.
 create temp table tc (id int primary key, aid int);
! server closed the connection unexpectedly
! This probably means the server terminated abnormally
! before or while processing the request.
! connection to server was lost

Both the "Result too large" and the lwlock.c stuck spinlock failures
have been repeated a number of times.  I think the "result too large"
error from semop() is probably a clue - is it possible that we somehow
go into an infinite loop incrementing the semaphore?  What else could
that error mean?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That sucks.  It was easy to see that the old fallback barrier
> implementation wasn't re-entrant, but this one should be.  And now
> that I look at it again, doesn't the failure message indicate that's
> not the problem anyway?

> ! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
> ! PANIC:  stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770

I was assuming that a leaky memory barrier was allowing the spinlock
state to become inconsistent, or at least to be perceived as inconsistent.
But I'm not too clear on how the barrier changes you and Andres have been
making have affected the spinlock code.
        regards, tom lane



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Sun, Jun 28, 2015 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> That sucks.  It was easy to see that the old fallback barrier
>> implementation wasn't re-entrant, but this one should be.  And now
>> that I look at it again, doesn't the failure message indicate that's
>> not the problem anyway?
>
>> ! PANIC:  stuck spinlock (c00000000d6f4140) detected at lwlock.c:816
>> ! PANIC:  stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770
>
> I was assuming that a leaky memory barrier was allowing the spinlock
> state to become inconsistent, or at least to be perceived as inconsistent.
> But I'm not too clear on how the barrier changes you and Andres have been
> making have affected the spinlock code.

For the most part, they haven't.  Andres did a bunch of work to add
atomics support, and overhauled the barrier implementation that I
committed to 9.2 along the way.  But that had minimal impact on
s_lock.h.

What we did do that touched s_lock.h was attempt to ensure that
SpinLockAcquire() and SpinLockRelease() function as compiler barriers,
so that it should no longer be necessary to litter the code with
"volatile" in every function that uses those.  It is possible that
this could be broken on HP-UX.  If _Asm_sched_fence() doesn't
constraint the compiler appropriately, that could explain the problems
we're seeing here.  But we're not the only one using that incantation,
so I'm left scratching my head.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> What we did do that touched s_lock.h was attempt to ensure that
> SpinLockAcquire() and SpinLockRelease() function as compiler barriers,
> so that it should no longer be necessary to litter the code with
> "volatile" in every function that uses those.  It is possible that
> this could be broken on HP-UX.  If _Asm_sched_fence() doesn't
> constraint the compiler appropriately, that could explain the problems
> we're seeing here.  But we're not the only one using that incantation,
> so I'm left scratching my head.

AFAICS, on non-gcc IA64, 9.4's version of S_UNLOCK defaulted to

#define S_UNLOCK(lock)        (*((volatile slock_t *) (lock)) = 0)

whereas in HEAD, we've got

#define S_UNLOCK(lock)    \do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)

which immediately raises the question of why omitting the "volatile"
cast is okay.  The comments for the gcc/icc version make it clear that
the volatile qual is pretty critical for those compilers.  I also wonder
if we don't need a second _Asm_sched_fence() after the lock release.
        regards, tom lane



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
> #define S_UNLOCK(lock)    \
>     do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)

Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
barrier? Shouldn't this be a _Asm_mf()?

> which immediately raises the question of why omitting the "volatile"
> cast is okay.

Should be fine if _Asm_sched_fence() were a proper memory (or een
release) barrier. Which I don't think it is.


> I also wonder if we don't need a second _Asm_sched_fence() after the
> lock release.

Shouldn't be needed - the only thing that could be reordered is the
actual lock release. Which will just impact timing in a minor manner (it
can't move into another locked section).

Greetings,

Andres Freund



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-29 12:11:08 +0200, Andres Freund wrote:
> On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
> > #define S_UNLOCK(lock)    \
> >     do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
> 
> Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
> barrier? Shouldn't this be a _Asm_mf()?

I've pushed a patch the _sched_fence with _mf. That's what we use in the
atomics code as well. I did reread
http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
and I do think _Asm_mf is the correct thing.

What I did not change was the mask used for serialization - the default
0x3D3D (c.f. page 23) should be correct, even though slightly
suboptimal.

Andres



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
>> #define S_UNLOCK(lock)        \
>>       do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
>
> Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
> barrier? Shouldn't this be a _Asm_mf()?

The point of the commit was to make spinlocks act as compiler barriers
as well as CPU barriers.  So I was just looking to add a compiler
barrier to what was already there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
> >> #define S_UNLOCK(lock)        \
> >>       do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
> >
> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
> > barrier? Shouldn't this be a _Asm_mf()?
> 
> The point of the commit was to make spinlocks act as compiler barriers
> as well as CPU barriers.  So I was just looking to add a compiler
> barrier to what was already there.

You removed a volatile at the same time, and volatile on IA64 has
acquire/release semantics.



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
>> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
>> >> #define S_UNLOCK(lock)        \
>> >>       do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
>> >
>> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
>> > barrier? Shouldn't this be a _Asm_mf()?
>>
>> The point of the commit was to make spinlocks act as compiler barriers
>> as well as CPU barriers.  So I was just looking to add a compiler
>> barrier to what was already there.
>
> You removed a volatile at the same time, and volatile on IA64 has
> acquire/release semantics.

Can you explain what you mean by volatile having acquire/release
semantics?  I don't see how volatile can create a CPU barrier, but I'm
guessing that it somehow can and that you're about to enlighten me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-29 22:45:49 -0400, Robert Haas wrote:
> On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
> >> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
> >> >> #define S_UNLOCK(lock)        \
> >> >>       do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
> >> >
> >> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
> >> > barrier? Shouldn't this be a _Asm_mf()?
> >>
> >> The point of the commit was to make spinlocks act as compiler barriers
> >> as well as CPU barriers.  So I was just looking to add a compiler
> >> barrier to what was already there.
> >
> > You removed a volatile at the same time, and volatile on IA64 has
> > acquire/release semantics.
> 
> Can you explain what you mean by volatile having acquire/release
> semantics?  I don't see how volatile can create a CPU barrier, but I'm
> guessing that it somehow can and that you're about to enlighten me.

It's a IA64 pecularity. I think it started with intel's compiler, but
since then at least msvc and gcc copied it. In essence volatile reads
implicitly have acquire semantics, and volatile writes release. That's
relatively cheap on IA64 because they have 'opcode tags' marking normal
moves etc. as having release or acquire semantics (mov.rel etc.).

We even have a comment about it that scratches the surface a bit:
/** Intel Itanium, gcc or Intel's compiler.** Itanium has weak memory ordering, but we rely on the compiler to enforce*
strictordering of accesses to volatile data.  In particular, while the* xchg instruction implicitly acts as a memory
barrierwith 'acquire'* semantics, we do not have an explicit memory fence instruction in the* S_UNLOCK macro.  We use a
regularassignment to clear the spinlock, and* trust that the compiler marks the generated store instruction with the*
".rel"opcode.** Testing shows that assumption to hold on gcc, although I could not find* any explicit statement on that
inthe gcc manual.  In Intel's compiler,* the -m[no-]serialize-volatile option controls that, and testing shows that* it
isenabled by default.*/
 




Re: anole: assorted stability problems

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 10:53 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-29 22:45:49 -0400, Robert Haas wrote:
>> On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
>> >> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote:
>> >> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
>> >> >> #define S_UNLOCK(lock)        \
>> >> >>       do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
>> >> >
>> >> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
>> >> > barrier? Shouldn't this be a _Asm_mf()?
>> >>
>> >> The point of the commit was to make spinlocks act as compiler barriers
>> >> as well as CPU barriers.  So I was just looking to add a compiler
>> >> barrier to what was already there.
>> >
>> > You removed a volatile at the same time, and volatile on IA64 has
>> > acquire/release semantics.
>>
>> Can you explain what you mean by volatile having acquire/release
>> semantics?  I don't see how volatile can create a CPU barrier, but I'm
>> guessing that it somehow can and that you're about to enlighten me.
>
> It's a IA64 pecularity. I think it started with intel's compiler, but
> since then at least msvc and gcc copied it. In essence volatile reads
> implicitly have acquire semantics, and volatile writes release. That's
> relatively cheap on IA64 because they have 'opcode tags' marking normal
> moves etc. as having release or acquire semantics (mov.rel etc.).
>
> We even have a comment about it that scratches the surface a bit:
> /*
>  * Intel Itanium, gcc or Intel's compiler.
>  *
>  * Itanium has weak memory ordering, but we rely on the compiler to enforce
>  * strict ordering of accesses to volatile data.  In particular, while the
>  * xchg instruction implicitly acts as a memory barrier with 'acquire'
>  * semantics, we do not have an explicit memory fence instruction in the
>  * S_UNLOCK macro.  We use a regular assignment to clear the spinlock, and
>  * trust that the compiler marks the generated store instruction with the
>  * ".rel" opcode.
>  *
>  * Testing shows that assumption to hold on gcc, although I could not find
>  * any explicit statement on that in the gcc manual.  In Intel's compiler,
>  * the -m[no-]serialize-volatile option controls that, and testing shows that
>  * it is enabled by default.
>  */

Ah.  So my bad, then, removing the volatile.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote:
>> You removed a volatile at the same time, and volatile on IA64 has
>> acquire/release semantics.

> Can you explain what you mean by volatile having acquire/release
> semantics?  I don't see how volatile can create a CPU barrier, but I'm
> guessing that it somehow can and that you're about to enlighten me.

It's late and I'm tired, but: gcc (and, apparently, icc) treats accesses
to volatile-qualified objects as cues to emit .acq or .rel memory ordering
qualifiers on IA64 instructions, per the comments in s_lock.h.  I have not
seen any documentation stating specifically that aCC does the same, but
the buildfarm evidence is pretty clear that the 9.4 IA64-non-gcc version
of S_UNLOCK worked and the up-to-now-9.5 version does not.  So personally,
I would be inclined to put back the volatile qualifier, independently of
any fooling around with _Asm_double_magic_xyzzy calls.  Or to put it
differently: where is the evidence that removing the volatile qual is a
good idea?
        regards, tom lane



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 10:58 PM, Tom Lane <tgl@sss.pgh.pa.us>
wrote:So personally,
> I would be inclined to put back the volatile qualifier, independently of
> any fooling around with _Asm_double_magic_xyzzy calls.  Or to put it
> differently: where is the evidence that removing the volatile qual is a
> good idea?

Personally, I have found that _Asm_double_magic_xyzzy makes is not
nearly as cromulent as _Asm_triple_magic_plugh.  But then, trying to
figure out compiler intrinsics on strange platforms makes me feel very
much like I'm in a maze of twisty little passages, all alike.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-29 22:58:05 -0400, Tom Lane wrote:
> So personally, I would be inclined to put back the volatile qualifier,
> independently of any fooling around with _Asm_double_magic_xyzzy
> calls.

I'm not sure. I think the reliance on an explicit memory barrier is a
lot more robust and easy to understand than some barely documented odd
behaviour around volatile. On the other hand the old way worked for a
long while.

I'm inclined to just do both on platforms as odd as IA6. But it'd like
to let anole run with the current set a bit longer - if it doesn't work
we have more problems than just S_UNLOCK(). It seems EDB has increased
the run rate for now, so it shouldn't take too long:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD

Greetings,

Andres Freund



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-29 23:05:55 -0400, Robert Haas wrote:
> Personally, I have found that _Asm_double_magic_xyzzy makes is not
> nearly as cromulent as _Asm_triple_magic_plugh.  But then, trying to
> figure out compiler intrinsics on strange platforms makes me feel very
> much like I'm in a maze of twisty little passages, all alike.

;)

Sometimes it even seems easier to just use actual assembler, inline or
not, than use intrinsics. But even the raw instructions, especially
their coherency side effects, are often poorly documented. It's a
mess. It's not helped by the fact that a lot of the web searches for
topics around this by now bring up postgresql archives.



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-06-30 11:35:56 +0200, Andres Freund wrote:
> On 2015-06-29 22:58:05 -0400, Tom Lane wrote:
> > So personally, I would be inclined to put back the volatile qualifier,
> > independently of any fooling around with _Asm_double_magic_xyzzy
> > calls.
>
> I'm not sure. I think the reliance on an explicit memory barrier is a
> lot more robust and easy to understand than some barely documented odd
> behaviour around volatile. On the other hand the old way worked for a
> long while.
>
> I'm inclined to just do both on platforms as odd as IA6. But it'd like
> to let anole run with the current set a bit longer - if it doesn't work
> we have more problems than just S_UNLOCK(). It seems EDB has increased
> the run rate for now, so it shouldn't take too long:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD

So, it's starting to look good. Not exactly allowing for a lot of
confidence yet, but still:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD

I'm inclined to simply revise the comments now, and *not* reintroduce
the volatile. The assumptions documented in:

/** Intel Itanium, gcc or Intel's compiler.** Itanium has weak memory ordering, but we rely on the compiler to enforce*
strictordering of accesses to volatile data.  In particular, while the* xchg instruction implicitly acts as a memory
barrierwith 'acquire'* semantics, we do not have an explicit memory fence instruction in the* S_UNLOCK macro.  We use a
regularassignment to clear the spinlock, and* trust that the compiler marks the generated store instruction with the*
".rel"opcode.** Testing shows that assumption to hold on gcc, although I could not find* any explicit statement on that
inthe gcc manual.  In Intel's compiler,* the -m[no-]serialize-volatile option controls that, and testing shows that* it
isenabled by default.*/
 

don't sound exactly bullet proof to me. I also personally find explicit
barriers easier to understand in the light of all the other spinlock
implementations.

Comments?



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Tue, Jul 7, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-30 11:35:56 +0200, Andres Freund wrote:
>> On 2015-06-29 22:58:05 -0400, Tom Lane wrote:
>> > So personally, I would be inclined to put back the volatile qualifier,
>> > independently of any fooling around with _Asm_double_magic_xyzzy
>> > calls.
>>
>> I'm not sure. I think the reliance on an explicit memory barrier is a
>> lot more robust and easy to understand than some barely documented odd
>> behaviour around volatile. On the other hand the old way worked for a
>> long while.
>>
>> I'm inclined to just do both on platforms as odd as IA6. But it'd like
>> to let anole run with the current set a bit longer - if it doesn't work
>> we have more problems than just S_UNLOCK(). It seems EDB has increased
>> the run rate for now, so it shouldn't take too long:
>> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD
>
> So, it's starting to look good. Not exactly allowing for a lot of
> confidence yet, but still:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD
>
> I'm inclined to simply revise the comments now, and *not* reintroduce
> the volatile. The assumptions documented in:
>
> /*
>  * Intel Itanium, gcc or Intel's compiler.
>  *
>  * Itanium has weak memory ordering, but we rely on the compiler to enforce
>  * strict ordering of accesses to volatile data.  In particular, while the
>  * xchg instruction implicitly acts as a memory barrier with 'acquire'
>  * semantics, we do not have an explicit memory fence instruction in the
>  * S_UNLOCK macro.  We use a regular assignment to clear the spinlock, and
>  * trust that the compiler marks the generated store instruction with the
>  * ".rel" opcode.
>  *
>  * Testing shows that assumption to hold on gcc, although I could not find
>  * any explicit statement on that in the gcc manual.  In Intel's compiler,
>  * the -m[no-]serialize-volatile option controls that, and testing shows that
>  * it is enabled by default.
>  */
>
> don't sound exactly bullet proof to me. I also personally find explicit
> barriers easier to understand in the light of all the other spinlock
> implementations.
>
> Comments?

I'm fine with that plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: anole: assorted stability problems

From
Andres Freund
Date:
On 2015-07-07 13:25:24 +0200, Andres Freund wrote:
> So, it's starting to look good. Not exactly allowing for a lot of
> confidence yet, but still:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD

Since there have not been any relevant failures since, I'm going to
remove this issue from the open items list.



Re: anole: assorted stability problems

From
Robert Haas
Date:
On Sun, Jul 26, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-07 13:25:24 +0200, Andres Freund wrote:
>> So, it's starting to look good. Not exactly allowing for a lot of
>> confidence yet, but still:
>> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD
>
> Since there have not been any relevant failures since, I'm going to
> remove this issue from the open items list.

Woohoo!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company