Thread: What is happening on buildfarm member crake?
Since contrib/test_shm_mq went into the tree, crake has been crashing intermittently (maybe one time in three) at the contribcheck step. While there's not conclusive proof that test_shm_mq is at fault, the circumstantial evidence is pretty strong. For instance, in the postmaster log for the last failure, http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2014-01-19%2013%3A26%3A08 we see LOG: registering background worker "test_shm_mq" LOG: starting background worker process "test_shm_mq" LOG: worker process: test_shm_mq (PID 12585) exited with exit code 1 LOG: unregistering background worker "test_shm_mq" LOG: registering background worker "test_shm_mq" LOG: registering background worker "test_shm_mq" LOG: registering background worker "test_shm_mq" LOG: starting background worker process "test_shm_mq" LOG: starting background worker process "test_shm_mq" LOG: starting background worker process "test_shm_mq" LOG: worker process: test_shm_mq (PID 12588) exited with exit code 1 LOG: unregistering background worker "test_shm_mq" LOG: worker process: test_shm_mq (PID 12586) exited with exit code 1 LOG: unregistering background worker "test_shm_mq" LOG: worker process: test_shm_mq (PID 12587) exited with exit code 1 LOG: unregistering background worker "test_shm_mq" LOG: server process (PID 12584) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes Comparing the PIDs makes it seem pretty likely that what's crashing is the backend running the test_shm_mq test. Since the connected psql doesn't notice any problem, most likely the crash occurs during backend shutdown after psql has disconnected (which would also explain why no "current query" gets shown in the crash report). crake is running F16 x86_64, which I don't have installed here. So I tried to reproduce this on RHEL6 and F19 x86_64 machines, with build options matching what crake says it's using, with absolutely no success. I don't think any other buildfarm critters are showing this either, which makes it look like it's possibly specific to the particular compiler version crake has got. Or maybe there's some other factor needed to trigger it. Anyway, I wonder whether Andrew could get a stack trace from the core dump next time it happens. regards, tom lane
I wrote: > I don't think any other buildfarm critters are showing this either, Scratch that --- closer inspection of the buildfarm logs shows that kouprey and lapwing have suffered identical failures, though much less often than crake. That lets out the theory that it's x86_64 specific, or even 64-bit specific. Still have no idea why I can't reproduce it here. regards, tom lane
On 01/19/2014 04:14 PM, Tom Lane wrote: > I wrote: >> I don't think any other buildfarm critters are showing this either, > Scratch that --- closer inspection of the buildfarm logs shows that > kouprey and lapwing have suffered identical failures, though much less > often than crake. That lets out the theory that it's x86_64 specific, or > even 64-bit specific. Still have no idea why I can't reproduce it here. I can give you an account on the machine if you want to play. Also crake does produce backtraces on core dumps, and they are at the bottom of the buildfarm log. The latest failure backtrace is reproduced below. ================== stack trace: /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ================== [New LWP 12584] [Threaddebugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generatedby `postgres: buildfarm contrib_regression_test_shm_mq'. Program terminated with signal 11, Segmentation fault. #0 SetLatch (latch=0x1c) at pg_latch.c:509 509 if (latch->is_set) #0 SetLatch (latch=0x1c) at pg_latch.c:509 #1 0x000000000064c04e in procsignal_sigusr1_handler (postgres_signal_arg=<optimized out>) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289 #2 <signal handler called> #3 _dl_fini() at dl-fini.c:190 #4 0x000000361ba39931 in __run_exit_handlers (status=0, listp=0x361bdb1668, run_list_atexit=true)at exit.c:78 #5 0x000000361ba399b5 in __GI_exit (status=<optimized out>) at exit.c:100 #6 0x00000000006485a6in proc_exit (code=0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143 #7 0x0000000000663abb in PostgresMain (argc=<optimizedout>, argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq", username=<optimized out>)at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225 #8 0x000000000062220f in BackendRun(port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083 #9 BackendStartup (port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772 #10 ServerLoop () at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583 #11 PostmasterMain(argc=<optimized out>, argv=<optimized out>) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238 #12 0x000000000045e2e8 in main (argc=3,argv=0x12b7430) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205 cheers andrew
On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Also crake does produce backtraces on core dumps, and they are at the > bottom of the buildfarm log. The latest failure backtrace is reproduced > below. > > ================== stack trace: > /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ================== > [New LWP 12584] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `postgres: buildfarm > contrib_regression_test_shm_mq'. > Program terminated with signal 11, Segmentation fault. > #0 SetLatch (latch=0x1c) at pg_latch.c:509 > 509 if (latch->is_set) > #0 SetLatch (latch=0x1c) at pg_latch.c:509 > #1 0x000000000064c04e in procsignal_sigusr1_handler > (postgres_signal_arg=<optimized out>) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289 > #2 <signal handler called> > #3 _dl_fini () at dl-fini.c:190 > #4 0x000000361ba39931 in __run_exit_handlers (status=0, > listp=0x361bdb1668, run_list_atexit=true) at exit.c:78 > #5 0x000000361ba399b5 in __GI_exit (status=<optimized out>) at > exit.c:100 > #6 0x00000000006485a6 in proc_exit (code=0) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143 > #7 0x0000000000663abb in PostgresMain (argc=<optimized out>, > argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq", > username=<optimized out>) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225 > #8 0x000000000062220f in BackendRun (port=0x12d6bf0) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083 > #9 BackendStartup (port=0x12d6bf0) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772 > #10 ServerLoop () at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583 > #11 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238 > #12 0x000000000045e2e8 in main (argc=3, argv=0x12b7430) at > /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205 Hmm, that looks an awful lot like the SIGUSR1 signal handler is getting called after we've already completed shmem_exit. And indeed that seems like the sort of thing that would result in dying horribly in just this way. The obvious fix seems to be to check proc_exit_inprogress before doing anything that might touch shared memory, but there are a lot of other SIGUSR1 handlers that don't do that either. However, in those cases, the likely cause of a SIGUSR1 would be a sinval catchup interrupt or a recovery conflict, which aren't likely to be so far delayed that they arrive after we've already disconnected from shared memory. But the dynamic background workers stuff adds a new possible cause of SIGUSR1: the postmaster letting us know that a child has started or died. And that could happen even after we've detached shared memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/19/2014 08:22 PM, Robert Haas wrote: > Hmm, that looks an awful lot like the SIGUSR1 signal handler is > getting called after we've already completed shmem_exit. And indeed > that seems like the sort of thing that would result in dying horribly > in just this way. The obvious fix seems to be to check > proc_exit_inprogress before doing anything that might touch shared > memory, but there are a lot of other SIGUSR1 handlers that don't do > that either. However, in those cases, the likely cause of a SIGUSR1 > would be a sinval catchup interrupt or a recovery conflict, which > aren't likely to be so far delayed that they arrive after we've > already disconnected from shared memory. But the dynamic background > workers stuff adds a new possible cause of SIGUSR1: the postmaster > letting us know that a child has started or died. And that could > happen even after we've detached shared memory. > Is anything happening about this? We're still getting quite a few of these: <http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3&member=crake> cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 01/19/2014 08:22 PM, Robert Haas wrote: >> Hmm, that looks an awful lot like the SIGUSR1 signal handler is >> getting called after we've already completed shmem_exit. And indeed >> that seems like the sort of thing that would result in dying horribly >> in just this way. The obvious fix seems to be to check >> proc_exit_inprogress before doing anything that might touch shared >> memory, but there are a lot of other SIGUSR1 handlers that don't do >> that either. However, in those cases, the likely cause of a SIGUSR1 >> would be a sinval catchup interrupt or a recovery conflict, which >> aren't likely to be so far delayed that they arrive after we've >> already disconnected from shared memory. But the dynamic background >> workers stuff adds a new possible cause of SIGUSR1: the postmaster >> letting us know that a child has started or died. And that could >> happen even after we've detached shared memory. > Is anything happening about this? We're still getting quite a few of > these: > <http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3&member=crake> Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, then this is really just one instance of a bug that's probably pretty widespread in our signal handlers. Somebody needs to go through 'em all and look for touches of shared memory. I'm not sure if we can just disable signal response the moment the proc_exit_inprogress flag goes up, though. In some cases such as lock handling, it's likely that we need that functionality to keep working for some part of the shutdown process. We might end up having to disable individual signal handlers at appropriate places. Ick. regards, tom lane
On 01/25/2014 05:04 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 01/19/2014 08:22 PM, Robert Haas wrote: >>> Hmm, that looks an awful lot like the SIGUSR1 signal handler is >>> getting called after we've already completed shmem_exit. And indeed >>> that seems like the sort of thing that would result in dying horribly >>> in just this way. The obvious fix seems to be to check >>> proc_exit_inprogress before doing anything that might touch shared >>> memory, but there are a lot of other SIGUSR1 handlers that don't do >>> that either. However, in those cases, the likely cause of a SIGUSR1 >>> would be a sinval catchup interrupt or a recovery conflict, which >>> aren't likely to be so far delayed that they arrive after we've >>> already disconnected from shared memory. But the dynamic background >>> workers stuff adds a new possible cause of SIGUSR1: the postmaster >>> letting us know that a child has started or died. And that could >>> happen even after we've detached shared memory. >> Is anything happening about this? We're still getting quite a few of >> these: >> <http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3&member=crake> > Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, > then this is really just one instance of a bug that's probably pretty > widespread in our signal handlers. Somebody needs to go through 'em > all and look for touches of shared memory. > > I'm not sure if we can just disable signal response the moment the > proc_exit_inprogress flag goes up, though. In some cases such as lock > handling, it's likely that we need that functionality to keep working > for some part of the shutdown process. We might end up having to disable > individual signal handlers at appropriate places. > > Ick. > > Yeah. Since we're now providing for user-defined backends, maybe we need some mechanism for white-listing handlers that can run in whole or part during shutdown. cheers andrew
On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, > then this is really just one instance of a bug that's probably pretty > widespread in our signal handlers. Somebody needs to go through 'em > all and look for touches of shared memory. I haven't made a comprehensive study of every signal handler we have, but looking at procsignal_sigusr1_handler, the list of functions that can get called from here is quite short: CheckProcSignal(), RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler(). Taking those in reverse order: - latch_sigusr1_handler() is fine. Nothing down this path touches shared memory; moreover, if we've already disowned our latch, the waiting flag won't be set and this will do nothing at all. - The call to SetLatch() is problematic as we already know. This is new code in 9.4. - RecoveryConflictInterrupt() does nothing if proc_exit_inprogress is set. So it's fine. - CheckProcSignal() also appears problematic. If we've already detached shared memory, MyProcSignalSlot will be pointing to garbage, but we'll try to dereference it anyway. I think maybe the best fix is to *clear* MyProc in ProcKill/AuxiliaryProcKill and MyProcSignalSlot in CleanupProcSignalState, as shown in the attached patch. Most places that dereference those pointers already check that they aren't null, and we can easily add a NULL guard to the SetLatch() call in procsignal_sigusr1_handler, which the attached patch also does. This might not be a complete fix to every problem of this type that exists anywhere in our code, but I think it's enough to make the world safe for procsignal_sigusr1_handler. We also have a *large* number of signal handlers that do little more than this: if (MyProc) SetLatch(&MyProc->procLatch); ...and this change should make all of those safe as well. So I think this is a pretty good start. Assuming nobody objects too much to this basic approach, should I back-patch the parts of this that apply pre-9.4? The problem with CleanupProcSignalState, at least, goes all the way back to 9.0, when the signal-multiplexing infrastructure was introduced. But the probability of an actual crash must be pretty low, or I imagine we would have noticed this sooner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, >> then this is really just one instance of a bug that's probably pretty >> widespread in our signal handlers. Somebody needs to go through 'em >> all and look for touches of shared memory. > I haven't made a comprehensive study of every signal handler we have, > but looking at procsignal_sigusr1_handler, the list of functions that > can get called from here is quite short: CheckProcSignal(), > RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler(). You forgot HandleCatchupInterrupt and HandleNotifyInterrupt, but those are also safe because they do nothing once proc_exit_inprogress is set. > I think maybe the best fix is to *clear* MyProc in > ProcKill/AuxiliaryProcKill and MyProcSignalSlot in > CleanupProcSignalState, as shown in the attached patch. This looks good to me in principle. A couple minor beefs: * The addition to CleanupProcSignalState could use a comment, similar to the one you added in ProcKill. * I think the code in ProcKill and AuxiliaryProcKill might be more readable if the new local variable was named "myproc" (lower case). > and we can easily add a NULL guard to the SetLatch() call in > procsignal_sigusr1_handler, which the attached patch also does. Um ... no such change actually visible in patch, but it's clearly necessary. > This might not be a complete fix to every problem of this type that > exists anywhere in our code, but I think it's enough to make the world > safe for procsignal_sigusr1_handler. Yeah; at the least this should cut down on the buildfarm noise we are seeing ATM. > Assuming nobody objects too much to this basic approach, should I > back-patch the parts of this that apply pre-9.4? Yes, I think so. We have seen some reports of irreproducible crashes at process exit, and maybe this explains them. regards, tom lane
On Fri, Jan 31, 2014 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This looks good to me in principle. A couple minor beefs: > > * The addition to CleanupProcSignalState could use a comment, > similar to the one you added in ProcKill. OK. > * I think the code in ProcKill and AuxiliaryProcKill might be more > readable if the new local variable was named "myproc" (lower case). grep indicates that naming is less common that what I picked, so I chose to stick with what I picked. >> and we can easily add a NULL guard to the SetLatch() call in >> procsignal_sigusr1_handler, which the attached patch also does. > > Um ... no such change actually visible in patch, but it's clearly > necessary. Fixed. >> This might not be a complete fix to every problem of this type that >> exists anywhere in our code, but I think it's enough to make the world >> safe for procsignal_sigusr1_handler. > > Yeah; at the least this should cut down on the buildfarm noise we > are seeing ATM. > >> Assuming nobody objects too much to this basic approach, should I >> back-patch the parts of this that apply pre-9.4? > > Yes, I think so. We have seen some reports of irreproducible crashes > at process exit, and maybe this explains them. OK, done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, >> then this is really just one instance of a bug that's probably pretty >> widespread in our signal handlers. Somebody needs to go through 'em >> all and look for touches of shared memory. > I haven't made a comprehensive study of every signal handler we have, > [ but here's how to fix procsignal_sigusr1_handler ] I've trawled all the remaining signal handlers (or at least everything declared with SIGNAL_ARGS, which hopefully is all of them). I find the following bugs: 1. A couple of signal handlers doif (MyWalSnd) SetLatch(&MyWalSnd->latch); which is fine as far as it goes, but the shutdown sequence in WalSndKill has exactly the same bug you just fixed in ProcKill: it needs to clear MyWalSnd before disowning the latch, not after. 2. WalRcvSigUsr1Handler and worker_spi_sighup fail to preserve errno. Will fix, but the latter bug is a tad disappointing considering that the coding rule about saving errno in signal handlers has been there for a *long* time. regards, tom lane