Thread: anole: assorted stability problems
In HEAD only. Previous branches seem mostly clean, so there's something going wrong. Spinlocks going wrong perhaps? http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-09%2020%3A30%3A29 ! PANIC: semop(id=0) failed: Result too large ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-05%2018%3A39%3A38&stg=check ! FATAL: semop(id=0) failed: File too large ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-03%2012%3A30%3A18 ! PANIC: semop(id=-1073741824) failed: Invalid argument ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-04-29%2004%3A30%3A25 ! PANIC: stuck spinlock (c00000000d335360) detected at lwlock.c:767 ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-05-20 16:21:57 -0300, Alvaro Herrera wrote: > In HEAD only. Previous branches seem mostly clean, so there's something > going wrong. Spinlocks going wrong perhaps? > > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check > ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost Hm. Anole hasn't reported reliably for a while before these. It's quite possible that this is a ac++ portability problem around the atomics. There's lots of other IA64 animals not having problems, but they're not using ac++. Greetings, Andres Freund
Andres Freund wrote: > On 2015-05-20 16:21:57 -0300, Alvaro Herrera wrote: > > In HEAD only. Previous branches seem mostly clean, so there's something > > going wrong. Spinlocks going wrong perhaps? > > > > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check > > ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 > > ! server closed the connection unexpectedly > > ! This probably means the server terminated abnormally > > ! before or while processing the request. > > ! connection to server was lost > > Hm. Anole hasn't reported reliably for a while before these. It's quite > possible that this is a ac++ portability problem around the > atomics. There's lots of other IA64 animals not having problems, but > they're not using ac++. Lots? As far as I can tell, this is the only Itanium machine in the buildfarm. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-05-20 16:44:12 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > Hm. Anole hasn't reported reliably for a while before these. It's quite > > possible that this is a ac++ portability problem around the > > atomics. There's lots of other IA64 animals not having problems, but > > they're not using ac++. > > Lots? As far as I can tell, this is the only Itanium machine in the > buildfarm. Uh. I'm pretty sure there were some back when that patch went in. And there definitely used to be a couple earlier. I guess itanium really is dying (mixed bad: It's a horrible architecture, but more coverage would still be good).
Andres Freund <andres@anarazel.de> writes: > On 2015-05-20 16:44:12 -0300, Alvaro Herrera wrote: >> Andres Freund wrote: >>> Hm. Anole hasn't reported reliably for a while before these. It's quite >>> possible that this is a ac++ portability problem around the >>> atomics. There's lots of other IA64 animals not having problems, but >>> they're not using ac++. >> Lots? As far as I can tell, this is the only Itanium machine in the >> buildfarm. > Uh. I'm pretty sure there were some back when that patch went in. And > there definitely used to be a couple earlier. I guess itanium really is > dying (mixed bad: It's a horrible architecture, but more coverage would > still be good). Since that machine is run by EDB, maybe we could persuade them to set up a second critter on it that uses gcc. That would at least help narrow down whether it's a compiler-specific issue. (It's times like this that I regret not working for Red Hat any more, and having access to all their test hardware ...) regards, tom lane
On 5/20/15 3:09 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2015-05-20 16:44:12 -0300, Alvaro Herrera wrote: >>> Andres Freund wrote: >>> Lots? As far as I can tell, this is the only Itanium machine in the >>> buildfarm. ... > (It's times like this that I regret not working for Red Hat any more, > and having access to all their test hardware ...) I believe pg.us could fund acquisition of older hardware for such purposes, I think it's more a question of hosting and someone to maintain them... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-05-20 16:21:57 -0300, Alvaro Herrera wrote: > In HEAD only. Previous branches seem mostly clean, so there's something > going wrong. Spinlocks going wrong perhaps? > > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-20%2016%3A30%3A26&stg=check > ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-09%2020%3A30%3A29 > ! PANIC: semop(id=0) failed: Result too large > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost > > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=anole&dt=2015-05-05%2018%3A39%3A38&stg=check > ! FATAL: semop(id=0) failed: File too large > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-05-03%2012%3A30%3A18 > ! PANIC: semop(id=-1073741824) failed: Invalid argument > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-04-29%2004%3A30%3A25 > ! PANIC: stuck spinlock (c00000000d335360) detected at lwlock.c:767 > ! server closed the connection unexpectedly > ! This probably means the server terminated abnormally > ! before or while processing the request. > ! connection to server was lost And now: ! FATAL: semop(id=-2013265921) failed: Invalid argument ! CONTEXT: SQL statement "CREATE TEMP TABLE brin_result (cid tid)" ! PL/pgSQL function inline_code_block line 20 at SQL statement ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost Uhm: void s_init_lock_sema(volatile slock_t *lock) {static int counter = 0; *lock = (++counter) % NUM_SPINLOCK_SEMAPHORES; } One problem here might be that counter is signed. Once s_init_lock_sema has been called often enough for counter to wrap around strange things will happen. But - I don't see why this codepatch would even be hit once on this platform? It's only built !HAVE_SPINLOCKS which isn't the case on anole. So this appears to be an independent bug (9.4+). One that has lead me to find an atomics bug (9.5+, stupid forgotten codepath for atomics on spinlocks on semaphores) - which again should be independent, because it's again is only relevant when spinlocks aren't used... I'll fix both. But that leaves this problem.
Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Uh. I'm pretty sure there were some back when that patch went in. And > > there definitely used to be a couple earlier. I guess itanium really is > > dying (mixed bad: It's a horrible architecture, but more coverage would > > still be good). > > Since that machine is run by EDB, maybe we could persuade them to set up > a second critter on it that uses gcc. That would at least help narrow > down whether it's a compiler-specific issue. I pinged EDB about this several days ago, and they have now set up buildfarm member gharial running on the same machine, using gcc. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > > Uh. I'm pretty sure there were some back when that patch went in. And > > > there definitely used to be a couple earlier. I guess itanium really is > > > dying (mixed bad: It's a horrible architecture, but more coverage would > > > still be good). > > > > Since that machine is run by EDB, maybe we could persuade them to set up > > a second critter on it that uses gcc. That would at least help narrow > > down whether it's a compiler-specific issue. > > I pinged EDB about this several days ago, and they have now set up > buildfarm member gharial running on the same machine, using gcc. FWIW gharial does not show any problem whatsoever. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Alvaro Herrera wrote: >> Tom Lane wrote: >>> Andres Freund <andres@anarazel.de> writes: >>>> Uh. I'm pretty sure there were some back when that patch went in. And >>>> there definitely used to be a couple earlier. I guess itanium really is >>>> dying (mixed bad: It's a horrible architecture, but more coverage would >>>> still be good). >>> Since that machine is run by EDB, maybe we could persuade them to set up >>> a second critter on it that uses gcc. That would at least help narrow >>> down whether it's a compiler-specific issue. >> I pinged EDB about this several days ago, and they have now set up >> buildfarm member gharial running on the same machine, using gcc. > FWIW gharial does not show any problem whatsoever. I'd hoped that commit 1b468a131bd260c9041484f78b8580c7f232d580 would resolve this, but nope, anole is still getting occasional stuck spinlocks: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-06-28%2021%3A35%3A02 regards, tom lane
On Sun, Jun 28, 2015 at 7:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd hoped that commit 1b468a131bd260c9041484f78b8580c7f232d580 would > resolve this, but nope, anole is still getting occasional stuck spinlocks: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-06-28%2021%3A35%3A02 That sucks. It was easy to see that the old fallback barrier implementation wasn't re-entrant, but this one should be. And now that I look at it again, doesn't the failure message indicate that's not the problem anyway? ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 ! PANIC: stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770 That's just a straight-up SpinLockAcquire(), not a barrier call. The May 5th failure looked like this: ! FATAL: semop(id=0) failed: Result too large The May 1st failure seems to have died here: -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 28, 2015 at 9:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jun 28, 2015 at 7:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd hoped that commit 1b468a131bd260c9041484f78b8580c7f232d580 would >> resolve this, but nope, anole is still getting occasional stuck spinlocks: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2015-06-28%2021%3A35%3A02 > > That sucks. It was easy to see that the old fallback barrier > implementation wasn't re-entrant, but this one should be. And now > that I look at it again, doesn't the failure message indicate that's > not the problem anyway? > > ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 > ! PANIC: stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770 > > That's just a straight-up SpinLockAcquire(), not a barrier call. > > The May 5th failure looked like this: > > ! FATAL: semop(id=0) failed: Result too large > > The May 1st failure seems to have died here: Ugh, hit "send" too soon. create temp table tc (id int primary key, aid int); ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost Both the "Result too large" and the lwlock.c stuck spinlock failures have been repeated a number of times. I think the "result too large" error from semop() is probably a clue - is it possible that we somehow go into an infinite loop incrementing the semaphore? What else could that error mean? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > That sucks. It was easy to see that the old fallback barrier > implementation wasn't re-entrant, but this one should be. And now > that I look at it again, doesn't the failure message indicate that's > not the problem anyway? > ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 > ! PANIC: stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770 I was assuming that a leaky memory barrier was allowing the spinlock state to become inconsistent, or at least to be perceived as inconsistent. But I'm not too clear on how the barrier changes you and Andres have been making have affected the spinlock code. regards, tom lane
On Sun, Jun 28, 2015 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> That sucks. It was easy to see that the old fallback barrier >> implementation wasn't re-entrant, but this one should be. And now >> that I look at it again, doesn't the failure message indicate that's >> not the problem anyway? > >> ! PANIC: stuck spinlock (c00000000d6f4140) detected at lwlock.c:816 >> ! PANIC: stuck spinlock (c00000000d72f6e0) detected at lwlock.c:770 > > I was assuming that a leaky memory barrier was allowing the spinlock > state to become inconsistent, or at least to be perceived as inconsistent. > But I'm not too clear on how the barrier changes you and Andres have been > making have affected the spinlock code. For the most part, they haven't. Andres did a bunch of work to add atomics support, and overhauled the barrier implementation that I committed to 9.2 along the way. But that had minimal impact on s_lock.h. What we did do that touched s_lock.h was attempt to ensure that SpinLockAcquire() and SpinLockRelease() function as compiler barriers, so that it should no longer be necessary to litter the code with "volatile" in every function that uses those. It is possible that this could be broken on HP-UX. If _Asm_sched_fence() doesn't constraint the compiler appropriately, that could explain the problems we're seeing here. But we're not the only one using that incantation, so I'm left scratching my head. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > What we did do that touched s_lock.h was attempt to ensure that > SpinLockAcquire() and SpinLockRelease() function as compiler barriers, > so that it should no longer be necessary to litter the code with > "volatile" in every function that uses those. It is possible that > this could be broken on HP-UX. If _Asm_sched_fence() doesn't > constraint the compiler appropriately, that could explain the problems > we're seeing here. But we're not the only one using that incantation, > so I'm left scratching my head. AFAICS, on non-gcc IA64, 9.4's version of S_UNLOCK defaulted to #define S_UNLOCK(lock) (*((volatile slock_t *) (lock)) = 0) whereas in HEAD, we've got #define S_UNLOCK(lock) \do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) which immediately raises the question of why omitting the "volatile" cast is okay. The comments for the gcc/icc version make it clear that the volatile qual is pretty critical for those compilers. I also wonder if we don't need a second _Asm_sched_fence() after the lock release. regards, tom lane
On 2015-06-29 00:42:53 -0400, Tom Lane wrote: > #define S_UNLOCK(lock) \ > do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? > which immediately raises the question of why omitting the "volatile" > cast is okay. Should be fine if _Asm_sched_fence() were a proper memory (or een release) barrier. Which I don't think it is. > I also wonder if we don't need a second _Asm_sched_fence() after the > lock release. Shouldn't be needed - the only thing that could be reordered is the actual lock release. Which will just impact timing in a minor manner (it can't move into another locked section). Greetings, Andres Freund
On 2015-06-29 12:11:08 +0200, Andres Freund wrote: > On 2015-06-29 00:42:53 -0400, Tom Lane wrote: > > #define S_UNLOCK(lock) \ > > do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) > > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler > barrier? Shouldn't this be a _Asm_mf()? I've pushed a patch the _sched_fence with _mf. That's what we use in the atomics code as well. I did reread http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf and I do think _Asm_mf is the correct thing. What I did not change was the mask used for serialization - the default 0x3D3D (c.f. page 23) should be correct, even though slightly suboptimal. Andres
On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-29 00:42:53 -0400, Tom Lane wrote: >> #define S_UNLOCK(lock) \ >> do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) > > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler > barrier? Shouldn't this be a _Asm_mf()? The point of the commit was to make spinlocks act as compiler barriers as well as CPU barriers. So I was just looking to add a compiler barrier to what was already there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-29 22:11:33 -0400, Robert Haas wrote: > On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-06-29 00:42:53 -0400, Tom Lane wrote: > >> #define S_UNLOCK(lock) \ > >> do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) > > > > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler > > barrier? Shouldn't this be a _Asm_mf()? > > The point of the commit was to make spinlocks act as compiler barriers > as well as CPU barriers. So I was just looking to add a compiler > barrier to what was already there. You removed a volatile at the same time, and volatile on IA64 has acquire/release semantics.
On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-29 22:11:33 -0400, Robert Haas wrote: >> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote: >> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote: >> >> #define S_UNLOCK(lock) \ >> >> do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) >> > >> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler >> > barrier? Shouldn't this be a _Asm_mf()? >> >> The point of the commit was to make spinlocks act as compiler barriers >> as well as CPU barriers. So I was just looking to add a compiler >> barrier to what was already there. > > You removed a volatile at the same time, and volatile on IA64 has > acquire/release semantics. Can you explain what you mean by volatile having acquire/release semantics? I don't see how volatile can create a CPU barrier, but I'm guessing that it somehow can and that you're about to enlighten me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-29 22:45:49 -0400, Robert Haas wrote: > On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-06-29 22:11:33 -0400, Robert Haas wrote: > >> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote: > >> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote: > >> >> #define S_UNLOCK(lock) \ > >> >> do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) > >> > > >> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler > >> > barrier? Shouldn't this be a _Asm_mf()? > >> > >> The point of the commit was to make spinlocks act as compiler barriers > >> as well as CPU barriers. So I was just looking to add a compiler > >> barrier to what was already there. > > > > You removed a volatile at the same time, and volatile on IA64 has > > acquire/release semantics. > > Can you explain what you mean by volatile having acquire/release > semantics? I don't see how volatile can create a CPU barrier, but I'm > guessing that it somehow can and that you're about to enlighten me. It's a IA64 pecularity. I think it started with intel's compiler, but since then at least msvc and gcc copied it. In essence volatile reads implicitly have acquire semantics, and volatile writes release. That's relatively cheap on IA64 because they have 'opcode tags' marking normal moves etc. as having release or acquire semantics (mov.rel etc.). We even have a comment about it that scratches the surface a bit: /** Intel Itanium, gcc or Intel's compiler.** Itanium has weak memory ordering, but we rely on the compiler to enforce* strictordering of accesses to volatile data. In particular, while the* xchg instruction implicitly acts as a memory barrierwith 'acquire'* semantics, we do not have an explicit memory fence instruction in the* S_UNLOCK macro. We use a regularassignment to clear the spinlock, and* trust that the compiler marks the generated store instruction with the* ".rel"opcode.** Testing shows that assumption to hold on gcc, although I could not find* any explicit statement on that inthe gcc manual. In Intel's compiler,* the -m[no-]serialize-volatile option controls that, and testing shows that* it isenabled by default.*/
On Mon, Jun 29, 2015 at 10:53 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-29 22:45:49 -0400, Robert Haas wrote: >> On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2015-06-29 22:11:33 -0400, Robert Haas wrote: >> >> On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund <andres@anarazel.de> wrote: >> >> > On 2015-06-29 00:42:53 -0400, Tom Lane wrote: >> >> >> #define S_UNLOCK(lock) \ >> >> >> do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) >> >> > >> >> > Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler >> >> > barrier? Shouldn't this be a _Asm_mf()? >> >> >> >> The point of the commit was to make spinlocks act as compiler barriers >> >> as well as CPU barriers. So I was just looking to add a compiler >> >> barrier to what was already there. >> > >> > You removed a volatile at the same time, and volatile on IA64 has >> > acquire/release semantics. >> >> Can you explain what you mean by volatile having acquire/release >> semantics? I don't see how volatile can create a CPU barrier, but I'm >> guessing that it somehow can and that you're about to enlighten me. > > It's a IA64 pecularity. I think it started with intel's compiler, but > since then at least msvc and gcc copied it. In essence volatile reads > implicitly have acquire semantics, and volatile writes release. That's > relatively cheap on IA64 because they have 'opcode tags' marking normal > moves etc. as having release or acquire semantics (mov.rel etc.). > > We even have a comment about it that scratches the surface a bit: > /* > * Intel Itanium, gcc or Intel's compiler. > * > * Itanium has weak memory ordering, but we rely on the compiler to enforce > * strict ordering of accesses to volatile data. In particular, while the > * xchg instruction implicitly acts as a memory barrier with 'acquire' > * semantics, we do not have an explicit memory fence instruction in the > * S_UNLOCK macro. We use a regular assignment to clear the spinlock, and > * trust that the compiler marks the generated store instruction with the > * ".rel" opcode. > * > * Testing shows that assumption to hold on gcc, although I could not find > * any explicit statement on that in the gcc manual. In Intel's compiler, > * the -m[no-]serialize-volatile option controls that, and testing shows that > * it is enabled by default. > */ Ah. So my bad, then, removing the volatile. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund <andres@anarazel.de> wrote: >> You removed a volatile at the same time, and volatile on IA64 has >> acquire/release semantics. > Can you explain what you mean by volatile having acquire/release > semantics? I don't see how volatile can create a CPU barrier, but I'm > guessing that it somehow can and that you're about to enlighten me. It's late and I'm tired, but: gcc (and, apparently, icc) treats accesses to volatile-qualified objects as cues to emit .acq or .rel memory ordering qualifiers on IA64 instructions, per the comments in s_lock.h. I have not seen any documentation stating specifically that aCC does the same, but the buildfarm evidence is pretty clear that the 9.4 IA64-non-gcc version of S_UNLOCK worked and the up-to-now-9.5 version does not. So personally, I would be inclined to put back the volatile qualifier, independently of any fooling around with _Asm_double_magic_xyzzy calls. Or to put it differently: where is the evidence that removing the volatile qual is a good idea? regards, tom lane
On Mon, Jun 29, 2015 at 10:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:So personally, > I would be inclined to put back the volatile qualifier, independently of > any fooling around with _Asm_double_magic_xyzzy calls. Or to put it > differently: where is the evidence that removing the volatile qual is a > good idea? Personally, I have found that _Asm_double_magic_xyzzy makes is not nearly as cromulent as _Asm_triple_magic_plugh. But then, trying to figure out compiler intrinsics on strange platforms makes me feel very much like I'm in a maze of twisty little passages, all alike. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-29 22:58:05 -0400, Tom Lane wrote: > So personally, I would be inclined to put back the volatile qualifier, > independently of any fooling around with _Asm_double_magic_xyzzy > calls. I'm not sure. I think the reliance on an explicit memory barrier is a lot more robust and easy to understand than some barely documented odd behaviour around volatile. On the other hand the old way worked for a long while. I'm inclined to just do both on platforms as odd as IA6. But it'd like to let anole run with the current set a bit longer - if it doesn't work we have more problems than just S_UNLOCK(). It seems EDB has increased the run rate for now, so it shouldn't take too long: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD Greetings, Andres Freund
On 2015-06-29 23:05:55 -0400, Robert Haas wrote: > Personally, I have found that _Asm_double_magic_xyzzy makes is not > nearly as cromulent as _Asm_triple_magic_plugh. But then, trying to > figure out compiler intrinsics on strange platforms makes me feel very > much like I'm in a maze of twisty little passages, all alike. ;) Sometimes it even seems easier to just use actual assembler, inline or not, than use intrinsics. But even the raw instructions, especially their coherency side effects, are often poorly documented. It's a mess. It's not helped by the fact that a lot of the web searches for topics around this by now bring up postgresql archives.
On 2015-06-30 11:35:56 +0200, Andres Freund wrote: > On 2015-06-29 22:58:05 -0400, Tom Lane wrote: > > So personally, I would be inclined to put back the volatile qualifier, > > independently of any fooling around with _Asm_double_magic_xyzzy > > calls. > > I'm not sure. I think the reliance on an explicit memory barrier is a > lot more robust and easy to understand than some barely documented odd > behaviour around volatile. On the other hand the old way worked for a > long while. > > I'm inclined to just do both on platforms as odd as IA6. But it'd like > to let anole run with the current set a bit longer - if it doesn't work > we have more problems than just S_UNLOCK(). It seems EDB has increased > the run rate for now, so it shouldn't take too long: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD So, it's starting to look good. Not exactly allowing for a lot of confidence yet, but still: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD I'm inclined to simply revise the comments now, and *not* reintroduce the volatile. The assumptions documented in: /** Intel Itanium, gcc or Intel's compiler.** Itanium has weak memory ordering, but we rely on the compiler to enforce* strictordering of accesses to volatile data. In particular, while the* xchg instruction implicitly acts as a memory barrierwith 'acquire'* semantics, we do not have an explicit memory fence instruction in the* S_UNLOCK macro. We use a regularassignment to clear the spinlock, and* trust that the compiler marks the generated store instruction with the* ".rel"opcode.** Testing shows that assumption to hold on gcc, although I could not find* any explicit statement on that inthe gcc manual. In Intel's compiler,* the -m[no-]serialize-volatile option controls that, and testing shows that* it isenabled by default.*/ don't sound exactly bullet proof to me. I also personally find explicit barriers easier to understand in the light of all the other spinlock implementations. Comments?
On Tue, Jul 7, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-30 11:35:56 +0200, Andres Freund wrote: >> On 2015-06-29 22:58:05 -0400, Tom Lane wrote: >> > So personally, I would be inclined to put back the volatile qualifier, >> > independently of any fooling around with _Asm_double_magic_xyzzy >> > calls. >> >> I'm not sure. I think the reliance on an explicit memory barrier is a >> lot more robust and easy to understand than some barely documented odd >> behaviour around volatile. On the other hand the old way worked for a >> long while. >> >> I'm inclined to just do both on platforms as odd as IA6. But it'd like >> to let anole run with the current set a bit longer - if it doesn't work >> we have more problems than just S_UNLOCK(). It seems EDB has increased >> the run rate for now, so it shouldn't take too long: >> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD > > So, it's starting to look good. Not exactly allowing for a lot of > confidence yet, but still: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD > > I'm inclined to simply revise the comments now, and *not* reintroduce > the volatile. The assumptions documented in: > > /* > * Intel Itanium, gcc or Intel's compiler. > * > * Itanium has weak memory ordering, but we rely on the compiler to enforce > * strict ordering of accesses to volatile data. In particular, while the > * xchg instruction implicitly acts as a memory barrier with 'acquire' > * semantics, we do not have an explicit memory fence instruction in the > * S_UNLOCK macro. We use a regular assignment to clear the spinlock, and > * trust that the compiler marks the generated store instruction with the > * ".rel" opcode. > * > * Testing shows that assumption to hold on gcc, although I could not find > * any explicit statement on that in the gcc manual. In Intel's compiler, > * the -m[no-]serialize-volatile option controls that, and testing shows that > * it is enabled by default. > */ > > don't sound exactly bullet proof to me. I also personally find explicit > barriers easier to understand in the light of all the other spinlock > implementations. > > Comments? I'm fine with that plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-07-07 13:25:24 +0200, Andres Freund wrote: > So, it's starting to look good. Not exactly allowing for a lot of > confidence yet, but still: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD Since there have not been any relevant failures since, I'm going to remove this issue from the open items list.
On Sun, Jul 26, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-07-07 13:25:24 +0200, Andres Freund wrote: >> So, it's starting to look good. Not exactly allowing for a lot of >> confidence yet, but still: >> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD > > Since there have not been any relevant failures since, I'm going to > remove this issue from the open items list. Woohoo! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company