Re: orangutan seizes up during isolation-check - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: orangutan seizes up during isolation-check |
Date | |
Msg-id | 20140915045114.GA1332666@tornado.leadboat.com Whole thread Raw |
In response to | Re: orangutan seizes up during isolation-check (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: orangutan seizes up during isolation-check
Re: orangutan seizes up during isolation-check |
List | pgsql-hackers |
On Tue, Sep 02, 2014 at 12:25:39AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Buildfarm member orangutan has failed chronically on both of the branches for > > which it still reports, HEAD and REL9_1_STABLE, for over two years. The > > postmaster appears to jam during isolation-check. Dave, orangutan currently > > has one such jammed postmaster for each branch. Could you gather some > > information about the running processes? > > What's particularly odd is that orangutan seems to be running an only > slightly out-of-date OS X release, which is hardly an unusual > configuration. My own laptop gets through isolation-check just fine. > Seems like there must be something nonstandard about orangutan's > software ... but what? Agreed. The difference is durable across OS X releases, because orangutan showed like symptoms under 10.7.3. Dave assisted me off-list with data collection and experimentation. Ultimately, --enable-nls was the key distinction, the absence of which spares the other OS X buildfarm animals. The explanation for ECONNREFUSED was more pedestrian than the reasons I had guessed. There were no jammed postmasters running as of the above writing. Rather, the postmasters were gone, but the socket directory entries remained. That happens when the postmaster suffers a "kill -9", a SIGSEGV, an assertion failure, or a similar abrupt exit. When I reproduced the problem, CountChildren() was attempting to walk a corrupt BackendList. Sometimes, the list had an entry such that e->next == e; these send CountChildren() into an infinite loop. Other times, testing "if (bp->dead_end)" prompted a segfault. That explains orangutan sometimes failing quickly and other times hanging for hours. Every crash showed at least two threads running in the postmaster. Multiple threads bring trouble in the form of undefined behavior for fork() w/o exec() and for sigprocmask(). The postmaster uses sigprocmask() to block most signals when doing something nontrivial; this allows it to do such nontrivial work in signal handlers. A sequence of 74 buildfarm runs caught 27 cases of a secondary thread running a signal handler, 14 cases of two signal handlers running at once, and one user-visible postmaster failure. libintl replaces setlocale(). Its setlocale(LC_x, "") uses OS-specific APIs to determine the default locale when $LANG and similar environment variables are empty, as they are during "make check NO_LOCALE=1". On OS X, it calls[1] CFLocaleCopyCurrent(), which in turn spins up a thread. See the end of this message for the postmaster thread stacks active upon hitting a breakpoint set at _dispatch_mgr_thread. I see two options for fixing this in pg_perm_setlocale(LC_x, ""): 1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale name through a pipe, and pass that name tosetlocale() in the original process. The short-lived child will get the extra threads, and the postmaster will remainclean. 2. On OS X, check for relevant environment variables. Finding none, set LC_x=C before calling setlocale(LC_x, ""). A variationis to raise ereport(FATAL) if sufficient environment variables aren't in place. Either way ensures the libintlsetlocale() will never call CFLocaleCopyCurrent(). This is simpler than (1), but it entails a behavior change: "LANG=initdb" will use LANG=C or fail rather than use the OS X user account locale. I'm skeptical of the value of looking up locale information using other OS X facilities when the usual environment variables are inconclusive, but I see no clear cause to reverse that decision now. I lean toward (1). Thanks, nm [1] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/localename.c;h=78dc344bba191417855670fb751210d3608db6e6;hb=HEAD#l2883 thread #1: tid = 0xeccea9, 0x00007fff9066b372 libsystem_notify.dylib`notify_register_check + 30, queue = 'com.apple.main-thread' frame #0: 0x00007fff9066b372 libsystem_notify.dylib`notify_register_check + 30 frame #1: 0x00007fff987cf261libsystem_info.dylib`__si_module_static_ds_block_invoke + 109 frame #2: 0x00007fff944d628d libdispatch.dylib`_dispatch_client_callout+ 8 frame #3: 0x00007fff944d61fc libdispatch.dylib`dispatch_once_f + 79 frame#4: 0x00007fff987cf1f2 libsystem_info.dylib`si_module_static_ds + 42 frame #5: 0x00007fff987cec65 libsystem_info.dylib`si_module_with_name+ 60 frame #6: 0x00007fff987cf0e7 libsystem_info.dylib`si_module_config_modules_for_category+ 168 frame #7: 0x00007fff987cedbd libsystem_info.dylib`__si_module_static_search_block_invoke+ 87 frame #8: 0x00007fff944d628d libdispatch.dylib`_dispatch_client_callout+ 8 frame #9: 0x00007fff944d61fc libdispatch.dylib`dispatch_once_f + 79 frame#10: 0x00007fff987ced64 libsystem_info.dylib`si_module_static_search + 42 frame #11: 0x00007fff987cec65 libsystem_info.dylib`si_module_with_name+ 60 frame #12: 0x00007fff987d0cf2 libsystem_info.dylib`getpwuid + 32 frame #13:0x00007fff8dce629c CoreFoundation`CFCopyHomeDirectoryURLForUser + 124 frame #14: 0x00007fff8dce5a84 CoreFoundation`+[CFPrefsSourcewithSourceForIdentifier:user:byHost:container:perform:] + 372 frame #15: 0x00007fff8dce58fbCoreFoundation`-[CFPrefsSearchListSource addSourceForIdentifier:user:byHost:] + 123 frame #16: 0x00007fff8dcec1bbCoreFoundation`+[CFPrefsSearchListSource withSnapshotSearchList:] + 331 frame #17: 0x00007fff8dcec037CoreFoundation`__CFXPreferencesCopyCurrentApplicationState + 151 frame #18: 0x00007fff8dcebb8c CoreFoundation`_CFLocaleCopyCurrentGuts+ 524 frame #19: 0x00000001006e6b87 libintl.8.dylib`_nl_locale_name_default + 47 frame #20: 0x00000001006e707b libintl.8.dylib`libintl_setlocale + 367 frame #21: 0x00000001002fdb52 postgres`pg_perm_setlocale(category=1,locale=<unavailable>) + 18 at pg_locale.c:153 frame #22: 0x00000001001a6ef7 postgres`main(argc=1,argv=0x0000000100907340) + 103 at main.c:127 thread #3: tid = 0xeccec2, 0x00007fff93349e6a libsystem_kernel.dylib`__workq_kernreturn + 10 frame #0: 0x00007fff93349e6alibsystem_kernel.dylib`__workq_kernreturn + 10 frame #1: 0x00007fff8ea13f08 libsystem_pthread.dylib`_pthread_wqthread+ 330 * thread #2: tid = 0xeccec3, 0x00007fff944d8102 libdispatch.dylib`_dispatch_mgr_thread, queue = 'com.apple.root.libdispatch-manager',stop reason = breakpoint 1.2 * frame #0: 0x00007fff944d8102 libdispatch.dylib`_dispatch_mgr_thread frame #1: 0x00007fff944d7f87 libdispatch.dylib`_dispatch_root_queue_drain + 75 frame #2: 0x00007fff944d7ed2 libdispatch.dylib`_dispatch_worker_thread + 119 frame #3: 0x00007fff8ea12899 libsystem_pthread.dylib`_pthread_body+ 138 frame #4: 0x00007fff8ea1272a libsystem_pthread.dylib`_pthread_start + 137
pgsql-hackers by date: