Thread: problems on Solaris
Buildfarm members casteroides and protosciurus have been having some problems that seem puzzling. These animals both run on the same machine, but with different compilers. casteroides runs with the Sun Studio 12 compiler, and has twice in the last 3 days demonstrated this error: [5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq" [5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4)detected at atomics.c:30 [5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3); [5561ce0c.51b7:26] LOG: server process(PID 21127) was terminated by signal 6 [5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384,(select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200, 3); [5561ce0c.51b7:28] LOG: terminating any other active server processes It's not constant - between the two failures was a success. protociurus runs with gcc 3.4.3 and gets this error: gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -I/usr/local/include-m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include -c -o specparse.o specparse.c In file included from /usr/include/sys/vnode.h:47, from /usr/include/sys/stream.h:22, from /usr/include/netinet/in.h:66, from /usr/include/netdb.h:98, from ../../../src/include/port.h:17, from ../../../src/include/c.h:1114, from ../../../src/include/postgres_fe.h:25, from specparse.y:13: /usr/include/sys/kstat.h:439: error: syntaxerror before numeric constant /usr/include/sys/kstat.h:463: error: syntax error before '}' token /usr/include/sys/kstat.h:464:error: syntax error before '}' token In file included from /usr/include/sys/stream.h:22, from /usr/include/netinet/in.h:66, from /usr/include/netdb.h:98, from ../../../src/include/port.h:17, from ../../../src/include/c.h:1114, from../../../src/include/postgres_fe.h:25, from specparse.y:13: /usr/include/sys/vnode.h:105: error:syntax error before "kstat_named_t" /usr/include/sys/vnode.h:107: error: syntax error before "nread" /usr/include/sys/vnode.h:108:error: syntax error before "read_bytes" /usr/include/sys/vnode.h:109: error: syntax errorbefore "nwrite" /usr/include/sys/vnode.h:110: error: syntax error before "write_bytes" /usr/include/sys/vnode.h:111:error: syntax error before "nioctl" /usr/include/sys/vnode.h:112: error: syntax error before"nsetfl" /usr/include/sys/vnode.h:113: error: syntax error before "ngetattr" /usr/include/sys/vnode.h:114: error:syntax error before "nsetattr" /usr/include/sys/vnode.h:115: error: syntax error before "naccess" /usr/include/sys/vnode.h:116:error: syntax error before "nlookup" /usr/include/sys/vnode.h:117: error: syntax error before"ncreate" /usr/include/sys/vnode.h:118: error: syntax error before "nremove" /usr/include/sys/vnode.h:119: error:syntax error before "nlink" /usr/include/sys/vnode.h:120: error: syntax error before "nrename" /usr/include/sys/vnode.h:121:error: syntax error before "nmkdir" /usr/include/sys/vnode.h:122: error: syntax error before"nrmdir" /usr/include/sys/vnode.h:123: error: syntax error before "nreaddir" /usr/include/sys/vnode.h:124: error:syntax error before "readdir_bytes" /usr/include/sys/vnode.h:125: error: syntax error before "nsymlink" /usr/include/sys/vnode.h:126:error: syntax error before "nreadlink" /usr/include/sys/vnode.h:127: error: syntax error before"nfsync" /usr/include/sys/vnode.h:128: error: syntax error before "ninactive" /usr/include/sys/vnode.h:129: error:syntax error before "nfid" /usr/include/sys/vnode.h:130: error: syntax error before "nrwlock" /usr/include/sys/vnode.h:131:error: syntax error before "nrwunlock" /usr/include/sys/vnode.h:132: error: syntax error before"nseek" /usr/include/sys/vnode.h:133: error: syntax error before "ncmp" /usr/include/sys/vnode.h:134: error: syntaxerror before "nfrlock" /usr/include/sys/vnode.h:135: error: syntax error before "nspace" /usr/include/sys/vnode.h:136:error: syntax error before "nrealvp" /usr/include/sys/vnode.h:137: error: syntax error before"ngetpage" /usr/include/sys/vnode.h:138: error: syntax error before "nputpage" /usr/include/sys/vnode.h:139: error:syntax error before "nmap" /usr/include/sys/vnode.h:140: error: syntax error before "naddmap" /usr/include/sys/vnode.h:141:error: syntax error before "ndelmap" /usr/include/sys/vnode.h:142: error: syntax error before"npoll" /usr/include/sys/vnode.h:143: error: syntax error before "ndump" /usr/include/sys/vnode.h:144: error: syntaxerror before "npathconf" /usr/include/sys/vnode.h:145: error: syntax error before "npageio" /usr/include/sys/vnode.h:146:error: syntax error before "ndumpctl" /usr/include/sys/vnode.h:147: error: syntax error before"ndispose" /usr/include/sys/vnode.h:148: error: syntax error before "nsetsecattr" /usr/include/sys/vnode.h:149:error: syntax error before "ngetsecattr" /usr/include/sys/vnode.h:150: error: syntax errorbefore "nshrlock" /usr/include/sys/vnode.h:151: error: syntax error before "nvnevent" gmake: *** [specparse.o] Error1 cheers andrew
On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote: > > Buildfarm members casteroides and protosciurus have been having some > problems that seem puzzling. These animals both run on the same machine, but > with different compilers. > > casteroides runs with the Sun Studio 12 compiler, and has twice in the last > 3 days demonstrated this error: > > [5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq" > [5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30 > [5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '')from generate_series(1,270000)), 200, 3); > [5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6 > [5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3); > [5561ce0c.51b7:28] LOG: terminating any other active server processes > > It's not constant - between the two failures was a success. That's indeed rather odd. For one the relevant code does nothing but lock/unlock a spinlock. For another, there's been no recent change to this and casteroides has been running happily for a long time. > protociurus runs with gcc 3.4.3 and gets this error: > > gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -I/usr/local/include-m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include -c -o specparse.o specparse.c > In file included from /usr/include/sys/vnode.h:47, > from /usr/include/sys/stream.h:22, > from /usr/include/netinet/in.h:66, > from /usr/include/netdb.h:98, > from ../../../src/include/port.h:17, > from ../../../src/include/c.h:1114, > from ../../../src/include/postgres_fe.h:25, > from specparse.y:13: > /usr/include/sys/kstat.h:439: error: syntax error before numeric constant > /usr/include/sys/kstat.h:463: error: syntax error before '}' token > /usr/include/sys/kstat.h:464: error: syntax error before '}' token > In file included from /usr/include/sys/stream.h:22, > from /usr/include/netinet/in.h:66, > from /usr/include/netdb.h:98, > from ../../../src/include/port.h:17, > from ../../../src/include/c.h:1114, > from ../../../src/include/postgres_fe.h:25, > from specparse.y:13: > /usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t" I'd noticed this one as well. This sounds like a installation problem, not really ours. Dave, any chance you could look into this, or give somebody an account to test what's up? Greetings, Andres Freund
On 05/24/2015 08:07 PM, Andres Freund wrote: > On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote: >> Buildfarm members casteroides and protosciurus have been having some >> problems that seem puzzling. These animals both run on the same machine, but >> with different compilers. >> >> casteroides runs with the Sun Studio 12 compiler, and has twice in the last >> 3 days demonstrated this error: >> >> [5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq" >> [5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30 >> [5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3); >> [5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6 >> [5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3); >> [5561ce0c.51b7:28] LOG: terminating any other active server processes >> >> It's not constant - between the two failures was a success. > That's indeed rather odd. For one the relevant code does nothing but > lock/unlock a spinlock. For another, there's been no recent change to > this and casteroides has been running happily for a long time. > > Yes, but it wasn't running these tests until a few days ago when its buildfarm software was upgraded. cheers andrew
On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote: > Yes, but it wasn't running these tests until a few days ago when its > buildfarm software was upgraded. But barriers are used in other places too...
On 05/25/2015 03:17 AM, Andres Freund wrote: > On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote: >> Yes, but it wasn't running these tests until a few days ago when its >> buildfarm software was upgraded. > > But barriers are used in other places too... fwiw: spoonbill just failed in the same part of the regression tests (and it is a Sparc64 box though not running solaris): http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2015-05-24%2023%3A00%3A07 Stefan
On 2015-05-25 09:12:35 +0200, Stefan Kaltenbrunner wrote: > On 05/25/2015 03:17 AM, Andres Freund wrote: > > On 2015-05-24 21:01:54 -0400, Andrew Dunstan wrote: > >> Yes, but it wasn't running these tests until a few days ago when its > >> buildfarm software was upgraded. > > > > But barriers are used in other places too... > > fwiw: spoonbill just failed in the same part of the regression tests > (and it is a Sparc64 box though not running solaris): > > > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2015-05-24%2023%3A00%3A07 With a quite different error though: PANIC: ERRORDATA_STACK_SIZE exceeded Hm. So we have a *occasional* stack size exceeded failure and an occasional spinlock error in test_shm_mq. I'm inclined to think that this is a shm_mq problem, and not a more general locking problem - it seems likely, but not guaranteed, that that'd have materialized elsewhere. Robert: IIRC there was some problems with shm_mq tests being stuck before, right? Greetings, Andres Freund
On Mon, May 25, 2015 at 1:07 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-24 19:44:37 -0400, Andrew Dunstan wrote: >> >> Buildfarm members casteroides and protosciurus have been having some >> problems that seem puzzling. These animals both run on the same machine, but >> with different compilers. >> >> casteroides runs with the Sun Studio 12 compiler, and has twice in the last >> 3 days demonstrated this error: >> >> [5561ce0c.51b7:25] LOG: starting background worker process "test_shm_mq" >> [5561ce1e.5287:9] PANIC: stuck spinlock (100cb77f4) detected at atomics.c:30 >> [5561ce1e.5287:10] STATEMENT: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '')from generate_series(1,270000)), 200, 3); >> [5561ce0c.51b7:26] LOG: server process (PID 21127) was terminated by signal 6 >> [5561ce0c.51b7:27] DETAIL: Failed process was running: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3); >> [5561ce0c.51b7:28] LOG: terminating any other active server processes >> >> It's not constant - between the two failures was a success. > > That's indeed rather odd. For one the relevant code does nothing but > lock/unlock a spinlock. For another, there's been no recent change to > this and casteroides has been running happily for a long time. > >> protociurus runs with gcc 3.4.3 and gets this error: >> >> gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -I/usr/local/include-m64 -I. -I../../../src/interfaces/libpq -I./../regress -I../../../src/include -c -o specparse.o specparse.c >> In file included from /usr/include/sys/vnode.h:47, >> from /usr/include/sys/stream.h:22, >> from /usr/include/netinet/in.h:66, >> from /usr/include/netdb.h:98, >> from ../../../src/include/port.h:17, >> from ../../../src/include/c.h:1114, >> from ../../../src/include/postgres_fe.h:25, >> from specparse.y:13: >> /usr/include/sys/kstat.h:439: error: syntax error before numeric constant >> /usr/include/sys/kstat.h:463: error: syntax error before '}' token >> /usr/include/sys/kstat.h:464: error: syntax error before '}' token >> In file included from /usr/include/sys/stream.h:22, >> from /usr/include/netinet/in.h:66, >> from /usr/include/netdb.h:98, >> from ../../../src/include/port.h:17, >> from ../../../src/include/c.h:1114, >> from ../../../src/include/postgres_fe.h:25, >> from specparse.y:13: >> /usr/include/sys/vnode.h:105: error: syntax error before "kstat_named_t" > > I'd noticed this one as well. This sounds like a installation problem, > not really ours. Dave, any chance you could look into this, or give > somebody an account to test what's up? I'm not going to be able to look at this, at least this week. I can give someone on the EDB team access - Robert; can one of your guys take a look? -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote: > Hm. So we have a *occasional* stack size exceeded failure and an > occasional spinlock error in test_shm_mq. I'm inclined to think that > this is a shm_mq problem, and not a more general locking problem - it > seems likely, but not guaranteed, that that'd have materialized > elsewhere. I think the problem might be that the spinlock-based memory barrier is not re-entrant. Suppose some kind of barrier operation is in process, and we've acquired the dummy spnlock but not yet released it. Just then, we receive a signal. Since the shm_mq code sets set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch. SetLatch now includes barrier operations, so we'll try to acquire and release the spinlock despite already holding it. Oops. > Robert: IIRC there was some problems with shm_mq tests being stuck > before, right? The last round of investigation, on anole, resulted in this fix: commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4 Author: Robert Haas <rhaas@postgresql.org> Date: Sat Oct 4 21:25:41 2014 -0400 Eliminate one background-worker-related flag variable. Teach sigusr1_handler() to use the same test for whether a worker might need to be started as ServerLoop(). Aside frombeing perhaps a bit simpler, this prevents a potentially-unbounded delay when starting a background worker. On someplatforms, select() doesn't return when interrupted by a signal, but is instead restarted, including a reset of thetimeout to the originally-requested value. If signals arrive often enough, but no connection requests arrive, sigusr1_handler()will be executed repeatedly, but the body of ServerLoop() won't be reached. This change ensures that,even in that case, background workers will eventually get launched. This is far from a perfect fix; really, we need select() to return control to ServerLoop() after an interrupt, eithervia the self-pipe trick or some other mechanism. But that's going to require more work and discussion, so let'sdo this for now to at least mitigate the damage. Per investigation of test_shm_mq failures on buildfarm member anole. The problem here isn't really with test_shm_mq; it's with the postmaster. To really make this work properly, we need to be able to use latches in the postmaster, and we need to generalize WaitLatchOrSocket so that it can wait for a latch of any of n sockets. Then ServerLoop can use that instead of calling select directly. This will probably look a lot like what you did to get rid of ImmediateInterruptOK. But all of that seems unrelated to the current problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-05-27 15:39:14 -0400, Robert Haas wrote: > On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote: > > Hm. So we have a *occasional* stack size exceeded failure and an > > occasional spinlock error in test_shm_mq. I'm inclined to think that > > this is a shm_mq problem, and not a more general locking problem - it > > seems likely, but not guaranteed, that that'd have materialized > > elsewhere. > > I think the problem might be that the spinlock-based memory barrier is > not re-entrant. Suppose some kind of barrier operation is in process, > and we've acquired the dummy spnlock but not yet released it. Just > then, we receive a signal. Since the shm_mq code sets > set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch. > SetLatch now includes barrier operations, so we'll try to acquire and > release the spinlock despite already holding it. Oops. Oh wow, that's bad, and could explain a couple of the problems we're seing. One possible way to fix is to replace the sequence with if (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier, even if the lock isn't free - which e.g. isn't the case for PowerPC's implementation :(
On Wed, May 27, 2015 at 6:55 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-27 15:39:14 -0400, Robert Haas wrote: >> On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hm. So we have a *occasional* stack size exceeded failure and an >> > occasional spinlock error in test_shm_mq. I'm inclined to think that >> > this is a shm_mq problem, and not a more general locking problem - it >> > seems likely, but not guaranteed, that that'd have materialized >> > elsewhere. >> >> I think the problem might be that the spinlock-based memory barrier is >> not re-entrant. Suppose some kind of barrier operation is in process, >> and we've acquired the dummy spnlock but not yet released it. Just >> then, we receive a signal. Since the shm_mq code sets >> set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch. >> SetLatch now includes barrier operations, so we'll try to acquire and >> release the spinlock despite already holding it. Oops. > > Oh wow, that's bad, and could explain a couple of the problems we're > seing. One possible way to fix is to replace the sequence with if > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier, > even if the lock isn't free - which e.g. isn't the case for PowerPC's > implementation :( Another possibility is to make the fallback barrier implementation a system call, like maybe kill(PostmasterPid, 0). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-05-27 21:23:34 -0400, Robert Haas wrote: > > Oh wow, that's bad, and could explain a couple of the problems we're > > seing. One possible way to fix is to replace the sequence with if > > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier, > > even if the lock isn't free - which e.g. isn't the case for PowerPC's > > implementation :( > > Another possibility is to make the fallback barrier implementation a > system call, like maybe kill(PostmasterPid, 0). It's not necessarily true that all system calls are effective barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs local error checking. It might be that the process existance check includes a lock that's sufficient, but I would not like to rely on it. Sending an actual signal probably would be, but has the potential of disrupting postmaster progress. I think we should just bite the bullet and require a barrier implementation for all architectures that have spinlock support. That should be fairly straightforward, even though distinctly unpleasurable, exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock() doesn't have the issue that spinlocks have) for --disable-spinlock platforms. If people agree with that way forward, I'll go through the platforms. The biggest one missing is probably solaris with sun's compiler. Greetings, Andres Freund
On Sat, May 30, 2015 at 7:09 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-27 21:23:34 -0400, Robert Haas wrote: >> > Oh wow, that's bad, and could explain a couple of the problems we're >> > seing. One possible way to fix is to replace the sequence with if >> > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier, >> > even if the lock isn't free - which e.g. isn't the case for PowerPC's >> > implementation :( >> >> Another possibility is to make the fallback barrier implementation a >> system call, like maybe kill(PostmasterPid, 0). > > It's not necessarily true that all system calls are effective > barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs > local error checking. It might be that the process existance check > includes a lock that's sufficient, but I would not like to rely on > it. Sending an actual signal probably would be, but has the potential of > disrupting postmaster progress. So pick a better system call? > I think we should just bite the bullet and require a barrier > implementation for all architectures that have spinlock support. That > should be fairly straightforward, even though distinctly unpleasurable, > exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock() > doesn't have the issue that spinlocks have) for --disable-spinlock > platforms. Like maybe this. > If people agree with that way forward, I'll go through the > platforms. The biggest one missing is probably solaris with sun's > compiler. Certainly, having real barriers everywhere would be great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-05-31 08:00:44 -0400, Robert Haas wrote: > On Sat, May 30, 2015 at 7:09 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-05-27 21:23:34 -0400, Robert Haas wrote: > >> > Oh wow, that's bad, and could explain a couple of the problems we're > >> > seing. One possible way to fix is to replace the sequence with if > >> > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier, > >> > even if the lock isn't free - which e.g. isn't the case for PowerPC's > >> > implementation :( > >> > >> Another possibility is to make the fallback barrier implementation a > >> system call, like maybe kill(PostmasterPid, 0). > > > > It's not necessarily true that all system calls are effective > > barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs > > local error checking. It might be that the process existance check > > includes a lock that's sufficient, but I would not like to rely on > > it. Sending an actual signal probably would be, but has the potential of > > disrupting postmaster progress. > > So pick a better system call? It's not yet entirely clear what that'd be unfortunately. Maybe we could use waitpid(PostmasterPid, status, WNOHANG) - afaics that should work. > > I think we should just bite the bullet and require a barrier > > implementation for all architectures that have spinlock support. That > > should be fairly straightforward, even though distinctly unpleasurable, > > exercise. And then use semaphores (PGSemaphoreUnlock();PGSemaphoreLock() > > doesn't have the issue that spinlocks have) for --disable-spinlock > > platforms. > > Like maybe this. On second thought they're unfortunately not entirely suitable. While we've had used semaphores in signal indirectly for a long while (e.g. deadlock detector, sinval code etc), they're formally not guaranteed to be signal safe. Greetings, Andres Freund
On 2015-05-31 01:09:18 +0200, Andres Freund wrote: > On 2015-05-27 21:23:34 -0400, Robert Haas wrote: > > > Oh wow, that's bad, and could explain a couple of the problems we're > > > seing. One possible way to fix is to replace the sequence with if > > > (!TAS(spin)) S_UNLOCK();. But that'd mean TAS() has to be a barrier, > > > even if the lock isn't free - which e.g. isn't the case for PowerPC's > > > implementation :( > > > > Another possibility is to make the fallback barrier implementation a > > system call, like maybe kill(PostmasterPid, 0). > > It's not necessarily true that all system calls are effective > barriers. I'm e.g. doubtful that kill(..., 0) is one as it only performs > local error checking. It might be that the process existance check > includes a lock that's sufficient, but I would not like to rely on > it. Sending an actual signal probably would be, but has the potential of > disrupting postmaster progress. I thought about various other syscalls we could use, and your proposal seems to be least worst. My idea of using waitpid() falls short because it only works for child processes. I think the kind of systems that we don't have barriers on, are unlikely to use complex stuff like RCU to manage access to process hierarchies. I reproduced the 'stuck' issue on x86 by #ifdef'ing out barrier support - about 50% of the time test_shm_mq gets stuck. Replacing it with kill(PostmasterPid, 0) "works". Unless somebody protests soon that's what I'm going to commit. It surely is better than easily reproducible hangs. I'm wondering wether we should add a #warning to atomic.c if either the fallback memory or compiler barrier is used? Might be annoying to people using -Werror, but I doubt that's possible anyway on such old systems. Greetings, Andres Freund
On Wed, Jun 24, 2015 at 8:42 AM, Andres Freund <andres@anarazel.de> wrote: > I'm wondering wether we should add a #warning to atomic.c if either the > fallback memory or compiler barrier is used? Might be annoying to people > using -Werror, but I doubt that's possible anyway on such old systems. #warning isn't totally portable, so I think it might be better not to do that. Yeah, it'll work in a lot of places, but the sorts of obscure systems where the fallbacks are used are also more likely to have funky compilers that just barf on the directive outright. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company