Thread: problems on Solaris

problems on Solaris

From

Andrew Dunstan

Date:

24 May 2015, 23:44:47

Buildfarm members casteroides and protosciurus have been having some 
problems that seem puzzling. These animals both run on the same machine, 
but with different compilers.

casteroides runs with the Sun Studio 12 compiler, and has twice in the 
last 3 days demonstrated this error:
   [5561ce0c.51b7:25] LOG:  starting background worker process "test_shm_mq"   [5561ce1e.5287:9] PANIC:  stuck spinlock
(100cb77f4)detected at atomics.c:30   [5561ce1e.5287:10] STATEMENT:  SELECT test_shm_mq_pipelined(16384, (select
string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3);   [5561ce0c.51b7:26] LOG:  server
process(PID 21127) was terminated by signal 6   [5561ce0c.51b7:27] DETAIL:  Failed process was running: SELECT
test_shm_mq_pipelined(16384,(select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200,
3);  [5561ce0c.51b7:28] LOG:  terminating any other active server processes
 

It's not constant - between the two failures was a success.

protociurus runs with gcc 3.4.3 and gets this error:
   gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g
-I/usr/local/include-m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include   -c -o specparse.o
specparse.c  In file included from /usr/include/sys/vnode.h:47,                     from /usr/include/sys/stream.h:22,
                  from /usr/include/netinet/in.h:66,                     from /usr/include/netdb.h:98,
  from ../../../src/include/port.h:17,                     from ../../../src/include/c.h:1114,                     from
../../../src/include/postgres_fe.h:25,                    from specparse.y:13:   /usr/include/sys/kstat.h:439: error:
syntaxerror before numeric constant   /usr/include/sys/kstat.h:463: error: syntax error before '}' token
/usr/include/sys/kstat.h:464:error: syntax error before '}' token   In file included from /usr/include/sys/stream.h:22,
                   from /usr/include/netinet/in.h:66,                     from /usr/include/netdb.h:98,
   from ../../../src/include/port.h:17,                     from ../../../src/include/c.h:1114,
from../../../src/include/postgres_fe.h:25,                     from specparse.y:13:   /usr/include/sys/vnode.h:105:
error:syntax error before "kstat_named_t"   /usr/include/sys/vnode.h:107: error: syntax error before "nread"
/usr/include/sys/vnode.h:108:error: syntax error before "read_bytes"   /usr/include/sys/vnode.h:109: error: syntax
errorbefore "nwrite"   /usr/include/sys/vnode.h:110: error: syntax error before "write_bytes"
/usr/include/sys/vnode.h:111:error: syntax error before "nioctl"   /usr/include/sys/vnode.h:112: error: syntax error
before"nsetfl"   /usr/include/sys/vnode.h:113: error: syntax error before "ngetattr"   /usr/include/sys/vnode.h:114:
error:syntax error before "nsetattr"   /usr/include/sys/vnode.h:115: error: syntax error before "naccess"
/usr/include/sys/vnode.h:116:error: syntax error before "nlookup"   /usr/include/sys/vnode.h:117: error: syntax error
before"ncreate"   /usr/include/sys/vnode.h:118: error: syntax error before "nremove"   /usr/include/sys/vnode.h:119:
error:syntax error before "nlink"   /usr/include/sys/vnode.h:120: error: syntax error before "nrename"
/usr/include/sys/vnode.h:121:error: syntax error before "nmkdir"   /usr/include/sys/vnode.h:122: error: syntax error
before"nrmdir"   /usr/include/sys/vnode.h:123: error: syntax error before "nreaddir"   /usr/include/sys/vnode.h:124:
error:syntax error before "readdir_bytes"   /usr/include/sys/vnode.h:125: error: syntax error before "nsymlink"
/usr/include/sys/vnode.h:126:error: syntax error before "nreadlink"   /usr/include/sys/vnode.h:127: error: syntax error
before"nfsync"   /usr/include/sys/vnode.h:128: error: syntax error before "ninactive"   /usr/include/sys/vnode.h:129:
error:syntax error before "nfid"   /usr/include/sys/vnode.h:130: error: syntax error before "nrwlock"
/usr/include/sys/vnode.h:131:error: syntax error before "nrwunlock"   /usr/include/sys/vnode.h:132: error: syntax error
before"nseek"   /usr/include/sys/vnode.h:133: error: syntax error before "ncmp"   /usr/include/sys/vnode.h:134: error:
syntaxerror before "nfrlock"   /usr/include/sys/vnode.h:135: error: syntax error before "nspace"
/usr/include/sys/vnode.h:136:error: syntax error before "nrealvp"   /usr/include/sys/vnode.h:137: error: syntax error
before"ngetpage"   /usr/include/sys/vnode.h:138: error: syntax error before "nputpage"   /usr/include/sys/vnode.h:139:
error:syntax error before "nmap"   /usr/include/sys/vnode.h:140: error: syntax error before "naddmap"
/usr/include/sys/vnode.h:141:error: syntax error before "ndelmap"   /usr/include/sys/vnode.h:142: error: syntax error
before"npoll"   /usr/include/sys/vnode.h:143: error: syntax error before "ndump"   /usr/include/sys/vnode.h:144: error:
syntaxerror before "npathconf"   /usr/include/sys/vnode.h:145: error: syntax error before "npageio"
/usr/include/sys/vnode.h:146:error: syntax error before "ndumpctl"   /usr/include/sys/vnode.h:147: error: syntax error
before"ndispose"   /usr/include/sys/vnode.h:148: error: syntax error before "nsetsecattr"
/usr/include/sys/vnode.h:149:error: syntax error before "ngetsecattr"   /usr/include/sys/vnode.h:150: error: syntax
errorbefore "nshrlock"   /usr/include/sys/vnode.h:151: error: syntax error before "nvnevent"   gmake: *** [specparse.o]
Error1
 

cheers

andrew

Re: problems on Solaris

From

Andres Freund

Date:

25 May 2015, 00:07:56

On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote:
> 
> Buildfarm members casteroides and protosciurus have been having some
> problems that seem puzzling. These animals both run on the same machine, but
> with different compilers.
> 
> casteroides runs with the Sun Studio 12 compiler, and has twice in the last
> 3 days demonstrated this error:
> 
>    [5561ce0c.51b7:25] LOG:  starting background worker process "test_shm_mq"
>    [5561ce1e.5287:9] PANIC:  stuck spinlock (100cb77f4) detected at atomics.c:30
>    [5561ce1e.5287:10] STATEMENT:  SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),
'')from generate_series(1,270000)), 200, 3);
 
>    [5561ce0c.51b7:26] LOG:  server process (PID 21127) was terminated by signal 6
>    [5561ce0c.51b7:27] DETAIL:  Failed process was running: SELECT test_shm_mq_pipelined(16384, (select
string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3);
 
>    [5561ce0c.51b7:28] LOG:  terminating any other active server processes
> 
> It's not constant - between the two failures was a success.

That's indeed rather odd. For one the relevant code does nothing but
lock/unlock a spinlock. For another, there's been no recent change to
this and casteroides has been running happily for a long time.

> protociurus runs with gcc 3.4.3 and gets this error:
> 
>    gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g
-I/usr/local/include-m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include   -c -o specparse.o
specparse.c
>    In file included from /usr/include/sys/vnode.h:47,
>                      from /usr/include/sys/stream.h:22,
>                      from /usr/include/netinet/in.h:66,
>                      from /usr/include/netdb.h:98,
>                      from ../../../src/include/port.h:17,
>                      from ../../../src/include/c.h:1114,
>                      from ../../../src/include/postgres_fe.h:25,
>                      from specparse.y:13:
>    /usr/include/sys/kstat.h:439: error: syntax error before numeric constant
>    /usr/include/sys/kstat.h:463: error: syntax error before '}' token
>    /usr/include/sys/kstat.h:464: error: syntax error before '}' token
>    In file included from /usr/include/sys/stream.h:22,
>                      from /usr/include/netinet/in.h:66,
>                      from /usr/include/netdb.h:98,
>                      from ../../../src/include/port.h:17,
>                      from ../../../src/include/c.h:1114,
>                      from ../../../src/include/postgres_fe.h:25,
>                      from specparse.y:13:
>    /usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t"

I'd noticed this one as well. This sounds like a installation problem,
not really ours. Dave, any chance you could look into this, or give
somebody an account to test what's up?

Greetings,

Andres Freund

Re: problems on Solaris

From

Andrew Dunstan

Date:

25 May 2015, 01:02:10

On 05/24/2015 08:07 PM, Andres Freund wrote:
> On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote:
>> Buildfarm members casteroides and protosciurus have been having some
>> problems that seem puzzling. These animals both run on the same machine, but
>> with different compilers.
>>
>> casteroides runs with the Sun Studio 12 compiler, and has twice in the last
>> 3 days demonstrated this error:
>>
>>     [5561ce0c.51b7:25] LOG:  starting background worker process "test_shm_mq"
>>     [5561ce1e.5287:9] PANIC:  stuck spinlock (100cb77f4) detected at atomics.c:30
>>     [5561ce1e.5287:10] STATEMENT:  SELECT test_shm_mq_pipelined(16384, (select
string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3);
 
>>     [5561ce0c.51b7:26] LOG:  server process (PID 21127) was terminated by signal 6
>>     [5561ce0c.51b7:27] DETAIL:  Failed process was running: SELECT test_shm_mq_pipelined(16384, (select
string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3);
 
>>     [5561ce0c.51b7:28] LOG:  terminating any other active server processes
>>
>> It's not constant - between the two failures was a success.
> That's indeed rather odd. For one the relevant code does nothing but
> lock/unlock a spinlock. For another, there's been no recent change to
> this and casteroides has been running happily for a long time.
>
>


Yes, but it wasn't running these tests until a few days ago when its 
buildfarm software was upgraded.

cheers

andrew

Re: problems on Solaris

From

Andres Freund

Date:

25 May 2015, 01:17:28

On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote:
> Yes, but it wasn't running these tests until a few days ago when its
> buildfarm software was upgraded.

But barriers are used in other places too...

Re: problems on Solaris

From

Stefan Kaltenbrunner

Date:

25 May 2015, 07:12:52

On 05/25/2015 03:17 AM, Andres Freund wrote:
> On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote:
>> Yes, but it wasn't running these tests until a few days ago when its
>> buildfarm software was upgraded.
> 
> But barriers are used in other places too...

fwiw: spoonbill just failed in the same part of the regression tests
(and it is a Sparc64 box though not running solaris):


http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2015-05-24%2023%3A00%3A07



Stefan

Re: problems on Solaris

From

Andres Freund

Date:

26 May 2015, 02:05:20

On 2015-05-25 09:12:35 +0200, Stefan Kaltenbrunner wrote:
> On 05/25/2015 03:17 AM, Andres Freund wrote:
> > On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote:
> >> Yes, but it wasn't running these tests until a few days ago when its
> >> buildfarm software was upgraded.
> > 
> > But barriers are used in other places too...
> 
> fwiw: spoonbill just failed in the same part of the regression tests
> (and it is a Sparc64 box though not running solaris):
> 
> 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2015-05-24%2023%3A00%3A07

With a quite different error though: PANIC:  ERRORDATA_STACK_SIZE exceeded

Hm. So we have a *occasional* stack size exceeded failure and an
occasional spinlock error in test_shm_mq. I'm inclined to think that
this is a shm_mq problem, and not a more general locking problem - it
seems likely, but not guaranteed, that that'd have materialized
elsewhere.

Robert: IIRC there was some problems with shm_mq tests being stuck
before, right?

Greetings,

Andres Freund

Re: problems on Solaris

From

Dave Page

Date:

26 May 2015, 08:10:34

On Mon, May 25, 2015 at 1:07 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote:
>>
>> Buildfarm members casteroides and protosciurus have been having some
>> problems that seem puzzling. These animals both run on the same machine, but
>> with different compilers.
>>
>> casteroides runs with the Sun Studio 12 compiler, and has twice in the last
>> 3 days demonstrated this error:
>>
>>    [5561ce0c.51b7:25] LOG:  starting background worker process "test_shm_mq"
>>    [5561ce1e.5287:9] PANIC:  stuck spinlock (100cb77f4) detected at atomics.c:30
>>    [5561ce1e.5287:10] STATEMENT:  SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),
'')from generate_series(1,270000)), 200, 3); 
>>    [5561ce0c.51b7:26] LOG:  server process (PID 21127) was terminated by signal 6
>>    [5561ce0c.51b7:27] DETAIL:  Failed process was running: SELECT test_shm_mq_pipelined(16384, (select
string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3); 
>>    [5561ce0c.51b7:28] LOG:  terminating any other active server processes
>>
>> It's not constant - between the two failures was a success.
>
> That's indeed rather odd. For one the relevant code does nothing but
> lock/unlock a spinlock. For another, there's been no recent change to
> this and casteroides has been running happily for a long time.
>
>> protociurus runs with gcc 3.4.3 and gets this error:
>>
>>    gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g
-I/usr/local/include-m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include   -c -o specparse.o
specparse.c
>>    In file included from /usr/include/sys/vnode.h:47,
>>                      from /usr/include/sys/stream.h:22,
>>                      from /usr/include/netinet/in.h:66,
>>                      from /usr/include/netdb.h:98,
>>                      from ../../../src/include/port.h:17,
>>                      from ../../../src/include/c.h:1114,
>>                      from ../../../src/include/postgres_fe.h:25,
>>                      from specparse.y:13:
>>    /usr/include/sys/kstat.h:439: error: syntax error before numeric constant
>>    /usr/include/sys/kstat.h:463: error: syntax error before '}' token
>>    /usr/include/sys/kstat.h:464: error: syntax error before '}' token
>>    In file included from /usr/include/sys/stream.h:22,
>>                      from /usr/include/netinet/in.h:66,
>>                      from /usr/include/netdb.h:98,
>>                      from ../../../src/include/port.h:17,
>>                      from ../../../src/include/c.h:1114,
>>                      from ../../../src/include/postgres_fe.h:25,
>>                      from specparse.y:13:
>>    /usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t"
>
> I'd noticed this one as well. This sounds like a installation problem,
> not really ours. Dave, any chance you could look into this, or give
> somebody an account to test what's up?

I'm not going to be able to look at this, at least this week. I can
give someone on the EDB team access - Robert; can one of your guys
take a look?

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: problems on Solaris

From

Robert Haas

Date:

27 May 2015, 19:39:22

On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Hm. So we have a *occasional* stack size exceeded failure and an
> occasional spinlock error in test_shm_mq. I'm inclined to think that
> this is a shm_mq problem, and not a more general locking problem - it
> seems likely, but not guaranteed, that that'd have materialized
> elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant.  Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it.  Just
then, we receive a signal.  Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it.  Oops.

> Robert: IIRC there was some problems with shm_mq tests being stuck
> before, right?

The last round of investigation, on anole, resulted in this fix:

commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4
Author: Robert Haas <rhaas@postgresql.org>
Date:   Sat Oct 4 21:25:41 2014 -0400
   Eliminate one background-worker-related flag variable.
   Teach sigusr1_handler() to use the same test for whether a worker   might need to be started as ServerLoop().  Aside
frombeing perhaps   a bit simpler, this prevents a potentially-unbounded delay when   starting a background worker.  On
someplatforms, select() doesn't   return when interrupted by a signal, but is instead restarted,   including a reset of
thetimeout to the originally-requested value.   If signals arrive often enough, but no connection requests arrive,
sigusr1_handler()will be executed repeatedly, but the body of   ServerLoop() won't be reached.  This change ensures
that,even in   that case, background workers will eventually get launched.

   This is far from a perfect fix; really, we need select() to return   control to ServerLoop() after an interrupt,
eithervia the self-pipe   trick or some other mechanism.  But that's going to require more   work and discussion, so
let'sdo this for now to at least mitigate   the damage.

   Per investigation of test_shm_mq failures on buildfarm member anole.

The problem here isn't really with test_shm_mq; it's with the
postmaster.  To really make this work properly, we need to be able to
use latches in the postmaster, and we need to generalize
WaitLatchOrSocket so that it can wait for a latch of any of n sockets.
Then ServerLoop can use that instead of calling select directly.  This
will probably look a lot like what you did to get rid of
ImmediateInterruptOK.

But all of that seems unrelated to the current problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: problems on Solaris

From

Andres Freund

Date:

27 May 2015, 22:55:38

On 2015-05-27 15:39:14 -0400, Robert Haas wrote:
> On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hm. So we have a *occasional* stack size exceeded failure and an
> > occasional spinlock error in test_shm_mq. I'm inclined to think that
> > this is a shm_mq problem, and not a more general locking problem - it
> > seems likely, but not guaranteed, that that'd have materialized
> > elsewhere.
>
> I think the problem might be that the spinlock-based memory barrier is
> not re-entrant.  Suppose some kind of barrier operation is in process,
> and we've acquired the dummy spnlock but not yet released it.  Just
> then, we receive a signal.  Since the shm_mq code sets
> set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
> SetLatch now includes barrier operations, so we'll try to acquire and
> release the spinlock despite already holding it.  Oops.

Oh wow, that's bad, and could explain a couple of the problems we're
seing. One possible way to fix is to replace the sequence with if
(!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
even if the lock isn't free - which e.g. isn't the case for PowerPC's
implementation :(

Re: problems on Solaris

From

Robert Haas

Date:

28 May 2015, 01:23:39

On Wed, May 27, 2015 at 6:55 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-27 15:39:14 -0400, Robert Haas wrote:
>> On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hm. So we have a *occasional* stack size exceeded failure and an
>> > occasional spinlock error in test_shm_mq. I'm inclined to think that
>> > this is a shm_mq problem, and not a more general locking problem - it
>> > seems likely, but not guaranteed, that that'd have materialized
>> > elsewhere.
>>
>> I think the problem might be that the spinlock-based memory barrier is
>> not re-entrant.  Suppose some kind of barrier operation is in process,
>> and we've acquired the dummy spnlock but not yet released it.  Just
>> then, we receive a signal.  Since the shm_mq code sets
>> set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
>> SetLatch now includes barrier operations, so we'll try to acquire and
>> release the spinlock despite already holding it.  Oops.
>
> Oh wow, that's bad, and could explain a couple of the problems we're
> seing. One possible way to fix is to replace the sequence with if
> (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
> even if the lock isn't free - which e.g. isn't the case for PowerPC's
> implementation :(

Another possibility is to make the fallback barrier implementation a
system call, like maybe kill(PostmasterPid, 0).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: problems on Solaris

From

Andres Freund

Date:

30 May 2015, 23:09:37

On 2015-05-27 21:23:34 -0400, Robert Haas wrote:
> > Oh wow, that's bad, and could explain a couple of the problems we're
> > seing. One possible way to fix is to replace the sequence with if
> > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
> > even if the lock isn't free - which e.g. isn't the case for PowerPC's
> > implementation :(
> 
> Another possibility is to make the fallback barrier implementation a
> system call, like maybe kill(PostmasterPid, 0).

It's not necessarily true that all system calls are effective
barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
local error checking. It might be that the process existance check
includes a lock that's sufficient, but I would not like to rely on
it. Sending an actual signal probably would be, but has the potential of
disrupting postmaster progress.

I think we should just bite the bullet and require a barrier
implementation for all architectures that have spinlock support. That
should be fairly straightforward, even though distinctly unpleasurable,
exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock()
doesn't have the issue that spinlocks have) for --disable-spinlock
platforms.

If people agree with that way forward, I'll go through the
platforms. The biggest one missing is probably solaris with sun's
compiler.

Greetings,

Andres Freund

Re: problems on Solaris

From

Robert Haas

Date:

31 May 2015, 12:00:48

On Sat, May 30, 2015 at 7:09 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-27 21:23:34 -0400, Robert Haas wrote:
>> > Oh wow, that's bad, and could explain a couple of the problems we're
>> > seing. One possible way to fix is to replace the sequence with if
>> > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
>> > even if the lock isn't free - which e.g. isn't the case for PowerPC's
>> > implementation :(
>>
>> Another possibility is to make the fallback barrier implementation a
>> system call, like maybe kill(PostmasterPid, 0).
>
> It's not necessarily true that all system calls are effective
> barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
> local error checking. It might be that the process existance check
> includes a lock that's sufficient, but I would not like to rely on
> it. Sending an actual signal probably would be, but has the potential of
> disrupting postmaster progress.

So pick a better system call?

> I think we should just bite the bullet and require a barrier
> implementation for all architectures that have spinlock support. That
> should be fairly straightforward, even though distinctly unpleasurable,
> exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock()
> doesn't have the issue that spinlocks have) for --disable-spinlock
> platforms.

Like maybe this.

> If people agree with that way forward, I'll go through the
> platforms. The biggest one missing is probably solaris with sun's
> compiler.

Certainly, having real barriers everywhere would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: problems on Solaris

From

Andres Freund

Date:

01 June 2015, 09:02:54

On 2015-05-31 08:00:44 -0400, Robert Haas wrote:
> On Sat, May 30, 2015 at 7:09 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-05-27 21:23:34 -0400, Robert Haas wrote:
> >> > Oh wow, that's bad, and could explain a couple of the problems we're
> >> > seing. One possible way to fix is to replace the sequence with if
> >> > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
> >> > even if the lock isn't free - which e.g. isn't the case for PowerPC's
> >> > implementation :(
> >>
> >> Another possibility is to make the fallback barrier implementation a
> >> system call, like maybe kill(PostmasterPid, 0).
> >
> > It's not necessarily true that all system calls are effective
> > barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
> > local error checking. It might be that the process existance check
> > includes a lock that's sufficient, but I would not like to rely on
> > it. Sending an actual signal probably would be, but has the potential of
> > disrupting postmaster progress.
> 
> So pick a better system call?

It's not yet entirely clear what that'd be unfortunately. Maybe we could
use waitpid(PostmasterPid, status, WNOHANG) - afaics that should work.

> > I think we should just bite the bullet and require a barrier
> > implementation for all architectures that have spinlock support. That
> > should be fairly straightforward, even though distinctly unpleasurable,
> > exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock()
> > doesn't have the issue that spinlocks have) for --disable-spinlock
> > platforms.
> 
> Like maybe this.

On second thought they're unfortunately not entirely suitable. While
we've had used semaphores in signal indirectly for a long while
(e.g. deadlock detector, sinval code etc), they're formally not
guaranteed to be signal safe.

Greetings,

Andres Freund

Re: problems on Solaris

From

Andres Freund

Date:

24 June 2015, 12:42:15

On 2015-05-31 01:09:18 +0200, Andres Freund wrote:
> On 2015-05-27 21:23:34 -0400, Robert Haas wrote:
> > > Oh wow, that's bad, and could explain a couple of the problems we're
> > > seing. One possible way to fix is to replace the sequence with if
> > > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier,
> > > even if the lock isn't free - which e.g. isn't the case for PowerPC's
> > > implementation :(
> > 
> > Another possibility is to make the fallback barrier implementation a
> > system call, like maybe kill(PostmasterPid, 0).
> 
> It's not necessarily true that all system calls are effective
> barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs
> local error checking. It might be that the process existance check
> includes a lock that's sufficient, but I would not like to rely on
> it. Sending an actual signal probably would be, but has the potential of
> disrupting postmaster progress.

I thought about various other syscalls we could use, and your proposal
seems to be least worst. My idea of using waitpid() falls short because
it only works for child processes.  I think the kind of systems that we
don't have barriers on, are unlikely to use complex stuff like RCU to
manage access to process hierarchies.

I reproduced the 'stuck' issue on x86 by #ifdef'ing out barrier support
- about 50% of the time test_shm_mq gets stuck. Replacing it with
kill(PostmasterPid, 0) "works". Unless somebody protests soon that's
what I'm going to commit. It surely is better than easily reproducible
hangs.

I'm wondering wether we should add a #warning to atomic.c if either the
fallback memory or compiler barrier is used? Might be annoying to people
using -Werror, but I doubt that's possible anyway on such old systems.

Greetings,

Andres Freund

Re: problems on Solaris

From

Robert Haas

Date:

24 June 2015, 17:12:23

On Wed, Jun 24, 2015 at 8:42 AM, Andres Freund <andres@anarazel.de> wrote:
> I'm wondering wether we should add a #warning to atomic.c if either the
> fallback memory or compiler barrier is used? Might be annoying to people
> using -Werror, but I doubt that's possible anyway on such old systems.

#warning isn't totally portable, so I think it might be better not to
do that.  Yeah, it'll work in a lot of places, but the sorts of
obscure systems where the fallbacks are used are also more likely to
have funky compilers that just barf on the directive outright.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company