Thread: BUG #6086: Segmentation fault
The following bug has been logged online: Bug reference: 6086 Logged by: Dennis Email address: dennis.noordsij@helsinki.fi PostgreSQL version: 9.0.4 Operating system: FreeBSD 8.2 (64 bit) Description: Segmentation fault Details: For some reason a 9.0.4 server which has always run completely stable, and without any recent changes, crashed ("WARNING: terminating connection because of crash of another server process" ..) Since it is not a debug build the backtrace is not that useful, but hopefully may give some idea of where to look, or ring a bell with someone. #0 0x0000000806a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3 #1 0x0000000800c1d965 in xmlRMutexLock () from /usr/local/lib/libxml2.so.5 #2 0x0000000800c6c67d in xmlDictFree () from /usr/local/lib/libxml2.so.5 #3 0x0000000800bb1ac5 in xmlFreeParserCtxt () from /usr/local/lib/libxml2.so.5 #4 0x0000000800bc3f19 in xmlParseBalancedChunkMemoryRecover () from /usr/local/lib/libxml2.so.5 #5 0x0000000800bc4090 in xmlParseBalancedChunkMemory () from /usr/local/lib/libxml2.so.5 #6 0x000000000068c6a0 in cursor_to_xml () #7 0x000000000068c7f7 in xml_in () #8 0x00000000006a3d0a in InputFunctionCall () #9 0x00000000006a494a in OidInputFunctionCall () #10 0x00000000004db6f7 in coerce_type () #11 0x00000000004dbcd7 in coerce_to_target_type () #12 0x00000000004dbd8a in coerce_to_specific_type () #13 0x00000000004de092 in transformWithClause () #14 0x00000000004de84b in transformExpr () #15 0x00000000004ddfad in transformWithClause () #16 0x00000000004de84b in transformExpr () #17 0x00000000004ddfad in transformWithClause () #18 0x00000000004de84b in transformExpr () #19 0x00000000004e9cb7 in transformTargetEntry () #20 0x00000000004ea459 in transformTargetList () #21 0x00000000004bff84 in transformStmt () #22 0x00000000005f0624 in pg_analyze_and_rewrite_params () #23 0x000000000055d63f in SPI_cursor_open_with_paramlist () #24 0x000000000055e2da in SPI_prepare_params () #25 0x0000000805f0d31c in exec_get_datum_type () from /usr/local/lib/postgresql/plpgsql.so #26 0x0000000805f0df0f in exec_get_datum_type () from /usr/local/lib/postgresql/plpgsql.so #27 0x0000000805f0f8b3 in exec_get_datum_type () from /usr/local/lib/postgresql/plpgsql.so #28 0x0000000805f117ed in exec_get_datum_type () from /usr/local/lib/postgresql/plpgsql.so #29 0x0000000805f1109b in exec_get_datum_type () from /usr/local/lib/postgresql/plpgsql.so #30 0x0000000805f136f2 in plpgsql_exec_function () from /usr/local/lib/postgresql/plpgsql.so #31 0x0000000805f093a3 in plpgsql_call_handler () from /usr/local/lib/postgresql/plpgsql.so #32 0x0000000000544f00 in GetAttributeByNum () #33 0x0000000000541dfa in ExecProject () #34 0x0000000000556293 in ExecResult () #35 0x000000000054196d in ExecProcNode () #36 0x00000000005406bf in standard_ExecutorRun () #37 0x00000000005f2d98 in PostgresMain () #38 0x00000000005f4281 in PortalRun () #39 0x00000000005f0ebb in pg_parse_query () #40 0x00000000005f18c7 in PostgresMain () #41 0x00000000005c1a9d in ClosePostmasterPorts () #42 0x00000000005c2767 in PostmasterMain () #43 0x000000000056cc7e in main () Most of what it usually does is, as can be seen in this backtrace, running plpgsql procedures which return XML data (so I'm hesitant to blame bad hardware at this point). It happened during a normal query that is ran very often, everything via psycopg2. Server uses WAL log shipping (crash did not coincide with running an archive command, no base backup in progress) I will switch to a debug build, but I haven't seen this crash before so it might never produce a useful backtrace. Thanks for any suggestions
On Thu, Jun 30, 2011 at 3:32 AM, Dennis <dennis.noordsij@helsinki.fi> wrote: > For some reason a 9.0.4 server which has always run completely stable, and > without any recent changes, crashed ("WARNING: =A0terminating connection > because of crash of another server process" ..) > > Since it is not a debug build the backtrace is not that useful, but > hopefully may give some idea of where to look, or ring a bell with someon= e. > > #0 =A00x0000000806a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3 > #1 =A00x0000000800c1d965 in xmlRMutexLock () from /usr/local/lib/libxml2.= so.5 > #2 =A00x0000000800c6c67d in xmlDictFree () from /usr/local/lib/libxml2.so= .5 > #3 =A00x0000000800bb1ac5 in xmlFreeParserCtxt () from > /usr/local/lib/libxml2.so.5 > #4 =A00x0000000800bc3f19 in xmlParseBalancedChunkMemoryRecover () from > /usr/local/lib/libxml2.so.5 > #5 =A00x0000000800bc4090 in xmlParseBalancedChunkMemory () from > /usr/local/lib/libxml2.so. Any idea what query triggered this? --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Any idea what query triggered this? Only up to which stored procedure (which itself contains multiple statements). However, in a new fresh FreeBSD environment (under virtualbox) I can trigger a similar (my guess is it is the same issue) segmentation fault reliably. Application is a Python web app, using psycopg2. It creates N worker threads, where each worker thread creates its own connection to postgresql. The threads then block on a global queue from which they read and process requests. That is, connections are not shared between threads, and all queries that are part of servicing a request are inside a single transaction bound to a single connection. Connections are re-used. (please assume this is implemented correctly on the client; a faulty client should not crash a server) If N=1, everything is OK. If N=3 (confirmed, but I suppose anything > 1) then the following happens for one particular testcase every single time: gdb `which postgres` /usr/local/pgsql/data/postgres.core GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `postgres'. Program terminated with signal 10, Bus error. Reading symbols from /usr/local/lib/libintl.so.9...done. Loaded symbols for /usr/local/lib/libintl.so.9 Reading symbols from /usr/local/lib/libxml2.so.5...done. Loaded symbols for /usr/local/lib/libxml2.so.5 Reading symbols from /usr/lib/libssl.so.6...done. Loaded symbols for /usr/lib/libssl.so.6 Reading symbols from /lib/libcrypto.so.6...done. Loaded symbols for /lib/libcrypto.so.6 Reading symbols from /lib/libm.so.5...done. Loaded symbols for /lib/libm.so.5 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /usr/local/lib/libiconv.so.3...done. Loaded symbols for /usr/local/lib/libiconv.so.3 Reading symbols from /lib/libz.so.5...done. Loaded symbols for /lib/libz.so.5 Reading symbols from /usr/local/lib/postgresql/plpgsql.so...done. Loaded symbols for /usr/local/lib/postgresql/plpgsql.so Reading symbols from /usr/local/lib/postgresql/citext.so...done. Loaded symbols for /usr/local/lib/postgresql/citext.so Reading symbols from /lib/libthr.so.3...done. Loaded symbols for /lib/libthr.so.3 Reading symbols from /usr/local/lib/postgresql/pg_trgm.so...done. Loaded symbols for /usr/local/lib/postgresql/pg_trgm.so Reading symbols from /usr/local/lib/postgresql/dict_snowball.so...done. Loaded symbols for /usr/local/lib/postgresql/dict_snowball.so Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x0000000802a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3 [New Thread 801a42540 (LWP 100325)] (gdb) bt #0 0x0000000802a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3 #1 0x0000000800d619ae in xmlRMutexLock (tok=0x801a12360) at threads.c:369 #2 0x0000000800dd6864 in xmlDictReference (dict=0x801a05760) at dict.c:510 #3 0x0000000800dd98e2 in xmlSAX2StartDocument (ctx=0x801a6c600) at SAX2.c:999 #4 0x0000000800cdf992 in xmlParseDocument (ctxt=0x801a6c600) at parser.c:10285 #5 0x0000000800ce8bc7 in xmlDoRead (ctxt=0x801a6c600, URL=0x0, encoding=0x0, options=0, reuse=1) at parser.c:14612 #6 0x0000000800ce915d in xmlCtxtReadMemory (ctxt=0x801a6c600, buffer=0x801b55440 "<anonymized xml"..., size=272, URL=0x0, encoding=0x0, options=0) at parser.c:14890 #7 0x000000000076fcd2 in xpath (fcinfo=0x7fffffffd310) at xml.c:3400 #8 0x000000000059e7bb in ExecMakeFunctionResult (fcache=0x80293fee0, econtext=0x802833140, isNull=0x7fffffffd84b "", isDone=0x0) at execQual.c:1827 #9 0x000000000059f183 in ExecEvalFunc (fcache=0x80293fee0, econtext=0x802833140, isNull=0x7fffffffd84b "", isDone=0x0) at execQual.c:2263 #10 0x00000008025c7514 in exec_eval_simple_expr (estate=0x7fffffffdb20, expr=0x8028df8e0, result=0x7fffffffd7e8, isNull=0x7fffffffd84b "", rettype=0x7fffffffd84c) at pl_exec.c:4597 #11 0x00000008025c6c8c in exec_eval_expr (estate=0x7fffffffdb20, expr=0x8028df8e0, isNull=0x7fffffffd84b "", rettype=0x7fffffffd84c) at pl_exec.c:4188 #12 0x00000008025c3755 in exec_stmt_raise (estate=0x7fffffffdb20, stmt=0x8028df800) at pl_exec.c:2485 #13 0x00000008025c18cb in exec_stmt (estate=0x7fffffffdb20, stmt=0x8028df800) at pl_exec.c:1326 #14 0x00000008025c168a in exec_stmts (estate=0x7fffffffdb20, stmts=0x801b5b0e8) at pl_exec.c:1233 #15 0x00000008025c14d5 in exec_stmt_block (estate=0x7fffffffdb20, block=0x8028e99a0) at pl_exec.c:1170 #16 0x00000008025bfa85 in plpgsql_exec_function (func=0x801bd8440, fcinfo=0x7fffffffdda0) at pl_exec.c:316 #17 0x00000008025ba99e in plpgsql_call_handler (fcinfo=0x7fffffffdda0) at pl_handler.c:122 #18 0x000000000059e7bb in ExecMakeFunctionResult (fcache=0x801bec440, econtext=0x801bec250, isNull=0x801bece98 "", isDone=0x801becfb0) at execQual.c:1827 #19 0x000000000059f183 in ExecEvalFunc (fcache=0x801bec440, econtext=0x801bec250, isNull=0x801bece98 "", isDone=0x801becfb0) at execQual.c:2263 #20 0x00000000005a5347 in ExecTargetList (targetlist=0x801becf80, econtext=0x801bec250, values=0x801bece80, isnull=0x801bece98 "", itemIsDone=0x801becfb0, isDone=0x7fffffffe344) at execQual.c:5089 #21 0x00000000005a5913 in ExecProject (projInfo=0x801beceb0, isDone=0x7fffffffe344) at execQual.c:5304 #22 0x00000000005b9da3 in ExecResult (node=0x801bec140) at nodeResult.c:155 #23 0x000000000059b3bf in ExecProcNode (node=0x801bec140) at execProcnode.c:355 #24 0x00000000005993dd in ExecutePlan (estate=0x801bec030, planstate=0x801bec140, operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x801b3fb30) at execMain.c:1188 #25 0x0000000000598063 in standard_ExecutorRun (queryDesc=0x801bbc830, direction=ForwardScanDirection, count=0) at execMain.c:280 #26 0x0000000000597f55 in ExecutorRun (queryDesc=0x801bbc830, direction=ForwardScanDirection, count=0) at execMain.c:229 #27 0x0000000000697ab8 in PortalRunSelect (portal=0x801a68030, forward=1 '\001', count=0, dest=0x801b3fb30) at pquery.c:952 #28 0x00000000006977ae in PortalRun (portal=0x801a68030, count=9223372036854775807, isTopLevel=1 '\001', dest=0x801b3fb30, altdest=0x801b3fb30, completionTag=0x7fffffffe640 "") at pquery.c:796 #29 0x0000000000691cb1 in exec_simple_query ( query_string=0x801a2a030 "select some_proc(2, E'<anonymized xml ..."...) at postgres.c:1058 #30 0x0000000000695c63 in PostgresMain (argc=2, argv=0x801a1a6f8, username=0x801a1a6c0 "pgsql") at postgres.c:3936 #31 0x0000000000657a9d in BackendRun (port=0x801a6c300) at postmaster.c:3555 #32 0x00000000006571da in BackendStartup (port=0x801a6c300) at postmaster.c:3242 #33 0x000000000065463c in ServerLoop () at postmaster.c:1431 #34 0x0000000000653e10 in PostmasterMain (argc=3, argv=0x7fffffffebb8) at postmaster.c:1092 #35 0x00000000005d7bca in main (argc=3, argv=0x7fffffffebb8) at main.c:188 Is some XML (parser) context somehow being shared / carried over between postgresql processes? postmaster does not seem to want to run under valgrind at all (perhaps low memory in the VM? should it work easily?) so I don't have more details about what it thinks goes wrong. I'll keep looking, but any suggestions are very welcome! Best, Dennis
On Sun, Jul 24, 2011 at 5:06 PM, noordsij <noordsij@cs.helsinki.fi> wrote: >> Any idea what query triggered this? > > Only up to which stored procedure (which itself contains multiple > statements). If you provide a reproducible test case, we can probably get this fixed. Otherwise, we probably can't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of lun jul 25 11:20:55 -0400 2011: > On Sun, Jul 24, 2011 at 5:06 PM, noordsij <noordsij@cs.helsinki.fi> wrote: > >> Any idea what query triggered this? > > > > Only up to which stored procedure (which itself contains multiple > > statements). > > If you provide a reproducible test case, we can probably get this fixed. > > Otherwise, we probably can't. But why is libxml calling pthread_mutex_lock and such? Unless libxml is making itself multithread inside a backend, that shouldn't matter -- much less cause a crash. Note (to the OP) that it's totally unexpected to have a Postgres backend become multithreaded, and has been shown to be dangerous when a library made it so. -- Ãlvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 07/25/2011 05:57 PM, Alvaro Herrera wrote: > Excerpts from Robert Haas's message of lun jul 25 11:20:55 -0400 2011: >> On Sun, Jul 24, 2011 at 5:06 PM, noordsij <noordsij@cs.helsinki.fi> wrote: >>>> Any idea what query triggered this? >>> >>> Only up to which stored procedure (which itself contains multiple >>> statements). >> >> If you provide a reproducible test case, we can probably get this fixed. >> >> Otherwise, we probably can't. > > But why is libxml calling pthread_mutex_lock and such? Unless libxml is > making itself multithread inside a backend, that shouldn't matter -- > much less cause a crash. > > Note (to the OP) that it's totally unexpected to have a Postgres backend > become multithreaded, and has been shown to be dangerous when a library > made it so. well we have seen issues on freebsd in the past with libraries there being compiled with threading support which caused them to get ridiculous low stack limits. Maybe it would be worth trying to recompile the libxml port without threading support (if that is an option). Stefan
After a few hours of watching strange things happening I finally stumbled on the cause. Very short summary: the postgres binary needs to be linked to libpthread, as this will ensure a special fork() inside libthr (the FreeBSD libpthread implementation/wrapper) is used which correctly deals with any pthread usage (I am guessing especially statically initialized pthread_X variables in for example libxml2). This can be accomplished by adding "-lpthread" to the LDFLAGS when building either libxml2 or postgresql itself, or by preloading libpthread.so when starting postgresql. Before advocating any particular solution, I'd like to throw this out first and see how strongly people feel about this. Cheers, and thanks to everyone who replied so far Dennis
noordsij <noordsij@cs.helsinki.fi> writes: > After a few hours of watching strange things happening I finally stumbled > on the cause. > Very short summary: the postgres binary needs to be linked to libpthread, > as this will ensure a special fork() inside libthr (the FreeBSD libpthread > implementation/wrapper) is used which correctly deals with any pthread > usage (I am guessing especially statically initialized pthread_X variables > in for example libxml2). > This can be accomplished by adding "-lpthread" to the LDFLAGS when building > either libxml2 or postgresql itself, or by preloading libpthread.so when > starting postgresql. Linking pthreads into the backend is likely to cause more problems than it solves, especially if you're proposing that we do that everywhere. regards, tom lane
> Linking pthreads into the backend is likely to cause more problems than > it solves, especially if you're proposing that we do that everywhere. > > regards, tom lane No, I am not proposing that. Not linking to pthread when not needed should actually be an optimization since the libc noop stubs get used instead. I've found the issue now (yes, again :) Short version: FreeBSD compiles the main server and contrib modules without and with pthreads, respectively. No bug in postgresql. If you're in a hurry, no need to read any further. Longer version: With FreeBSD ports, the contrib part is compiled separately. For some reason, the contrib units are being made to link to pthread, but the server itself isn't. So the backend forks to handle a request, some stored procedure touches some XML, libxml2 thinks it is initializing some mutex by calling some pthread_ stubs in libc, the stored procedure updates a column which has a trigram index, postgresql dynamically loads pg_trgm.so, which links to libpthread, and now the weak libc symbols are overridden and a little bit later we end up calling a real pthread_ method passing a non-initialized mutex and get a bus error. Lots of fun to debug, seeing pthread_self() suddenly changing return value within a single thread/backend in the middle of processing a single request.. With a custom (correct) compilation everything functions correctly with an unmodified libxml2 and postgresql, and without any linking to or preloading libpthread. Question is, why are postgresql90-server and postgresql90-contrib compiled differently ? It seems it may have been intended as a quick fix for pgbench, as the configure script for the -contrib port does not end up with "-pthread" in the PTHREAD_CFLAGS in Makefile.global which breaks compilation of pgbench, the server does have this, haven't found the reason yet. Side effect is that all modules link to pthread while the server doesn't .. I will continue this discussion within FreeBSD, as this is not a PostgreSQL bug (though if there is ever a loadable contrib module which does need pthreads the above situation will occur again). Thanks again to all who helped! Cheers Dennis