Thread: castoroides spinlock failure on test_shm_mq

castoroides spinlock failure on test_shm_mq

From
Alvaro Herrera
Date:
Has anybody noticed the way castoroides is randomly failing?
 SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from
generate_series(1,270000)),200, 3);
 
! PANIC:  stuck spinlock (100cb92f4) detected at atomics.c:30
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: castoroides spinlock failure on test_shm_mq

From
Robert Haas
Date:
On Sat, Jun 20, 2015 at 12:24 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Has anybody noticed the way castoroides is randomly failing?
>
>   SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from
generate_series(1,270000)),200, 3);
 
> ! PANIC:  stuck spinlock (100cb92f4) detected at atomics.c:30
> ! server closed the connection unexpectedly
> !       This probably means the server terminated abnormally
> !       before or while processing the request.
> ! connection to server was lost

Yeah, Andres and I discussed it a month ago:

http://www.postgresql.org/message-id/20150527225528.GP5310@alap3.anarazel.de

I think we're going to need to try to implement real memory barriers
on all architectures we support.  It's not clear whether there's some
suitable generic fallback that we could use or whether we're going to
need something different for each case.  I had thought Andres was
planning to work on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: castoroides spinlock failure on test_shm_mq

From
Andres Freund
Date:
On 2015-06-20 09:35:39 -0400, Robert Haas wrote:
> On Sat, Jun 20, 2015 at 12:24 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Has anybody noticed the way castoroides is randomly failing?
> >
> >   SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from
generate_series(1,270000)),200, 3);
 
> > ! PANIC:  stuck spinlock (100cb92f4) detected at atomics.c:30
> > ! server closed the connection unexpectedly
> > !       This probably means the server terminated abnormally
> > !       before or while processing the request.
> > ! connection to server was lost
> 
> Yeah, Andres and I discussed it a month ago:
> 
> http://www.postgresql.org/message-id/20150527225528.GP5310@alap3.anarazel.de
> 
> I think we're going to need to try to implement real memory barriers
> on all architectures we support.  It's not clear whether there's some
> suitable generic fallback that we could use or whether we're going to
> need something different for each case.  I had thought Andres was
> planning to work on this.

I am. I'd posted on the other thread that I want to use
waitpid(PostmasterPid, WNOHANG) as the fallback for now. Unless somebody
protests I'm going to commit that first, wait for a while to see wether
it stabilizes the solaris members, and then commit a better fallback for
solaris with suncc.

Greetings,

Andres Freund