Thread: BUG #1473: Backend bus error, possibly due to ANALYZE
The following bug has been logged online: Bug reference: 1473 Logged by: Brian B. Email address: brian-pgsql@bbdab.org PostgreSQL version: 8.0.1 Operating system: FreeBSD 4.11-STABLE Description: Backend bus error, possibly due to ANALYZE Details: Hello, I am loading some spam/ham data/tokens, to be used for the dspam anti-spam software, into PostgreSQL. After a few hours of inserting and updating the existing data, the backend crashes with a signal 10 (bus error). I am also running an ANALYZE command to help keep the indexes up to date every 10 minutes. I have rebuilt PostgreSQL with debugging enabled and have a resulting core dump. I am not sure who to send this to, however. The backtrace using the core file is as follows: #0 0x80c4b52 in transformExpr (pstate=0x86fa630, expr=0x85ec148) at parse_expr.c:81 #1 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec1d0) at parse_expr.c:270 #2 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec258) at parse_expr.c:270 #3 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec2e0) at parse_expr.c:270 [... repeat above several thousand times ...] #8139 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x86fa530) at parse_expr.c:270 #8140 0x80c50c2 in transformExpr (pstate=0x86fa630, expr=0x86fa558) at parse_expr.c:257 #8141 0x80c40e3 in transformWhereClause (pstate=0x86fa630, clause=0x86fa558, constructName=0x81f8e75 "WHERE") at parse_clause.c:1013 #8142 0x80b178d in transformSelectStmt (pstate=0x86fa630, stmt=0x86fa580) at analyze.c:1826 #8143 0x80af817 in transformStmt (pstate=0x86fa630, parseTree=0x86fa580, extras_before=0xbfbfe6f0, extras_after=0xbfbfe6f4) at analyze.c:371 #8144 0x80af5ba in do_parse_analyze (parseTree=0x86fa580, pstate=0x86fa630) at analyze.c:245 #8145 0x80af4ba in parse_analyze (parseTree=0x86fa580, paramTypes=0x0, numParams=0) at analyze.c:169 #8146 0x816fc9e in pg_analyze_and_rewrite (parsetree=0x86fa580, paramTypes=0x0, numParams=0) at postgres.c:555 #8147 0x8170061 in exec_simple_query ( query_string=0x83cd018 "SELECT uid, token, spam_hits, innocent_hits FROM dspam_token_data WHERE uid = '1000' AND token IN ('913609592713134710','6655100721407275609','-7433241509246025544','61795992 61633421689','588386968494"...) at postgres.c:872 #8148 0x8172713 in PostgresMain (argc=4, argv=0x82e4258, username=0x82e4238 "dspam") at postgres.c:3007 #8149 0x814dc49 in BackendRun (port=0x82e2000) at postmaster.c:2816 #8150 0x814d47e in BackendStartup (port=0x82e2000) at postmaster.c:2452 #8151 0x814b9ba in ServerLoop () at postmaster.c:1199 #8152 0x814b31a in PostmasterMain (argc=3, argv=0xbfbff3d0) at postmaster.c:918 #8153 0x811f2b9 in main (argc=3, argv=0xbfbff3d0) at main.c:268 (gdb) I have compiled PostgreSQL with pthread for pl/python capability. Thanks, Brian
On Thu, 2005-02-10 at 02:37 +0000, Brian B. wrote: > I am loading some spam/ham data/tokens, to be used for the dspam anti-spam > software, into PostgreSQL. After a few hours of inserting and updating the > existing data, the backend crashes with a signal 10 (bus error). I am also > running an ANALYZE command to help keep the indexes up to date every 10 > minutes. As far as I can tell, ANALYZE is not at issue here. > I have rebuilt PostgreSQL with debugging enabled and have a > resulting core dump. I am not sure who to send this to, however. The > backtrace using the core file is as follows: > > #0 0x80c4b52 in transformExpr (pstate=0x86fa630, expr=0x85ec148) > at parse_expr.c:81 > #1 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec1d0) > at parse_expr.c:270 > #2 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec258) > at parse_expr.c:270 > #3 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec2e0) > at parse_expr.c:270 > > [... repeat above several thousand times ...] It seems what's happening here is that dspam is submitting a query with many thousands of elements in the IN clause. In the parser, we transform "foo IN (a, b, c)" into "foo = a OR foo = b OR foo = c", and then recurse for each element of the OR expression and eventually run out of stack space. (Note that this will actually be worse in HEAD, since a refactoring I applied will mean we consume two stack frames for each expression.) A workaround would be to increase PostgreSQL's stack size. Perhaps it would be worth considering representing IN lists as a distinct expression type, at least in the parser. Then the transformExpr() code would look like: foreach (element of IN list) transformExpr(element); ... do whatever else ... so we wouldn't need to recurse. We could then transform the new expression type into a list of OR clauses at this point. -Neil
> It seems what's happening here is that dspam is submitting a=20 > query with many thousands of elements in the IN clause. In=20 > the parser, we transform "foo IN (a, b, c)" into "foo =3D a OR=20 > foo =3D b OR foo =3D c", and then recurse for each element of the=20 > OR expression and eventually run out of stack space. (Note=20 > that this will actually be worse in HEAD, since a refactoring=20 > I applied will mean we consume two stack frames for each > expression.) >=20 > A workaround would be to increase PostgreSQL's stack size. >=20 > Perhaps it would be worth considering representing IN lists=20 > as a distinct expression type, at least in the parser. Then the > transformExpr() code would look like: Just like I showed earlier on large IN () lists are useless.... Instead I use the UNNEST function I posted earlier (see http://archives.pos= tgresql.org/pgsql-hackers/2004-11/msg00327.php) like so: Select id from table inner join unnest(array[1,2,3,4,...]) as d(id) using(i= d); Not only does it not crash the backend,. But it also proved to be faster, t= ho admittedly not much. ... John
"Brian B." <brian-pgsql@bbdab.org> writes: > I am loading some spam/ham data/tokens, to be used for the dspam anti-spam > software, into PostgreSQL. After a few hours of inserting and updating the > existing data, the backend crashes with a signal 10 (bus error). Judging by the trace, this doesn't depend on your data at all, just on the specific query and the declarations of the tables it references. Please see if you can extract a self-contained test case that way. regards, tom lane
Neil Conway <neilc@samurai.com> writes: > It seems what's happening here is that dspam is submitting a query with > many thousands of elements in the IN clause. In the parser, we transform > "foo IN (a, b, c)" into "foo = a OR foo = b OR foo = c", and then > recurse for each element of the OR expression and eventually run out of > stack space. There is a check_stack_depth call in there, so this could only be the explanation if max_stack_depth is set too high for the actual stack depth limit. What's the platform, and what ulimit values is the postmaster started under? > Perhaps it would be worth considering representing IN lists as a > distinct expression type, at least in the parser. Then the > transformExpr() code would look like: > foreach (element of IN list) > transformExpr(element); > ... do whatever else ... > so we wouldn't need to recurse. We could then transform the new > expression type into a list of OR clauses at this point. Waste of time unless you were to propagate this representation all the way through; as described above you'd merely be postponing the stack depth problem to a later phase. regards, tom lane
Hello, On Thu, Feb 10, 2005 at 10:08:14AM -0500, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > It seems what's happening here is that dspam is submitting a query with > > many thousands of elements in the IN clause. In the parser, we transform > > "foo IN (a, b, c)" into "foo = a OR foo = b OR foo = c", and then > > recurse for each element of the OR expression and eventually run out of > > stack space. > > There is a check_stack_depth call in there, so this could only be the > explanation if max_stack_depth is set too high for the actual > stack depth limit. What's the platform, and what ulimit values is the > postmaster started under? FreeBSD 4.11 on x86 using PostgreSQL 8.0.1 % limits Resource limits (current): cputime infinity secs filesize 512000 kb datasize 524288 kb stacksize 65536 kb coredumpsize 307200 kb memoryuse-cur 458752 kb memorylocked-cur 458752 kb maxprocesses 512 openfiles 7351 sbsize infinity bytes vmemoryuse infinity kb max_stack_depth is at the default commented value of 2048 I can bump this value up to test it if desired, it just takes a while to get to the point of the backend crash scenario. I feed about 2300 good pieces of email using the dspam corpus feed program. Around 200 email messages left to feed, the backend will want to crash around that point. I will try to narrow down what query is being ran at that point. Perhaps I can extract it from the core dump. Thanks, Brian B.
"Brian B." <brian-pgsql@bbdab.org> writes: > On Thu, Feb 10, 2005 at 10:08:14AM -0500, Tom Lane wrote: >> There is a check_stack_depth call in there, so this could only be the >> explanation if max_stack_depth is set too high for the actual >> stack depth limit. What's the platform, and what ulimit values is the >> postmaster started under? > FreeBSD 4.11 on x86 using PostgreSQL 8.0.1 > % limits > Resource limits (current): > stacksize 65536 kb Hmm, 64meg should certainly be far past where check_stack_depth will start to complain, so there's something else going on here. (BUT: the ulimits you see interactively aren't necessarily what the postmaster was started under. You might want to confirm that the results are the same if you put the limits call into the boot script that launches the postmaster...) > the backend will want to crash around that point. I will try > to narrow down what query is being ran at that point. Perhaps > I can extract it from the core dump. See debug_query_string. regards, tom lane
Tom Lane wrote: > Hmm, 64meg should certainly be far past where check_stack_depth will > start to complain, so there's something else going on here. Right; if those really are the stack size limits for the crashing backend, I guess my initial analysis must have been mistaken. But I'm mystified as to what else the problem could possibly be... -Neil
On Thu, Feb 10, 2005 at 05:24:12PM -0500, Tom Lane wrote: > Hmm, 64meg should certainly be far past where check_stack_depth will > start to complain, so there's something else going on here. (BUT: > the ulimits you see interactively aren't necessarily what the postmaster > was started under. You might want to confirm that the results are the > same if you put the limits call into the boot script that launches > the postmaster...) I put limits in the startup script for PostgreSQL and came up with the same 64MB stacksize value. > See debug_query_string. I've got the gdb output of printing debug_query_string and the text saved from script(1) is about 267KB in size. Should I send this to you off-list? Thanks, Brian B.
"Brian B." <brian-pgsql@bbdab.org> writes: >> See debug_query_string. > I've got the gdb output of printing debug_query_string and the text saved > from script(1) is about 267KB in size. Should I send this to you off-list? Please. Also I'll need the schemas for the tables the query uses. regards, tom lane
On Fri, Feb 11, 2005 at 04:56:16PM -0500, Tom Lane wrote: > regression=# create function infinite_recurse() returns int as ' > regression'# select infinite_recurse()' language sql; > regression=# \set VERBOSITY terse > regression=# select infinite_recurse(); > ERROR: stack depth limit exceeded > > and see if you get the proper error or a core dump? This makes the backend core dump in about 5-10 seconds. It looks like your analysis is correct. Excellent work. I am not sure if it's my particular FreeBSD installation that is screwing up or if it's FreeBSD in general. Did you happen to test with FreeBSD on one of your test machines? > If it dumps core, then the thing to look at is check_stack_depth() in > src/backend/tcop/postgres.c. Maybe your compiler is bogusly optimizing > the address arithmetic there? It is possible. Using gcc 2.95.4. I will eliminate all the optimization options when I recompile PostgreSQL and see what happens. Thanks, Brian B.
"Brian B." <brian-pgsql@bbdab.org> writes: > Did you happen to test with FreeBSD on one of your test machines? No, I don't have any BSD machines here. However, I've added this test case to the regression tests, so in a few hours we'll have a spectrum of results from the PG build farm. regards, tom lane
>> Did you happen to test with FreeBSD on one of your test machines? > No, I don't have any BSD machines here. However, I've added this test > case to the regression tests, so in a few hours we'll have a spectrum > of results from the PG build farm. FWIW, I see a "pass" from buildfarm member cockatoo, which claims to be FreeBSD 4.10-STABLE gcc 2.95.4 x86 ... that's at least pretty close to your setup, no? http://www.pgbuildfarm.org/cgi-bin/show_status.pl regards, tom lane
On Fri, Feb 11, 2005 at 06:46:55PM -0500, Tom Lane wrote: > >> Did you happen to test with FreeBSD on one of your test machines? > > > No, I don't have any BSD machines here. However, I've added this test > > case to the regression tests, so in a few hours we'll have a spectrum > > of results from the PG build farm. > > FWIW, I see a "pass" from buildfarm member cockatoo, which claims to be > FreeBSD 4.10-STABLE gcc 2.95.4 x86 ... that's at least pretty close to > your setup, no? Yup ... that's really close. I'm going to do a few tests to try to figure out what's going on with my environment and let you know if I come up with anything odd. Thanks for your time! Brian B.
On Fri, Feb 11, 2005 at 06:54:31PM -0500, I wrote: > Yup ... that's really close. I'm going to do a few tests to try to figure > out what's going on with my environment and let you know if I come up with > anything odd. I think I have figured out the culprit. I use the FreeBSD PostgreSQL port and I set the option to use threads so that pl/python will work with PostgreSQL. If I unset this option and rebuild the port, I will receive the normal error message but no crash. Just to make sure it was not the port's fault, I build the source by hand without any port-specific patches. The same behavior occurs. So I am not sure if, again, it's my libc_r or something gets messed up in the PostgreSQL code dealing with the stack/recursion, when I have pthread libs linked in. I am guessing it is just my particular environment, though. Thanks, Brian B.
"Brian B." <brian-pgsql@bbdab.org> writes: > I use the FreeBSD PostgreSQL port and I set the option to use threads so > that pl/python will work with PostgreSQL. What option is that, exactly? It's entirely possible that something has decided that the backend is going to be multithreaded and is only giving the "main" thread a 1MB-or-so stack. If so, I would regard this as a build error. We do not want threading libraries linked into the backend. regards, tom lane
On Fri, Feb 11, 2005 at 09:46:50PM -0500, Tom Lane wrote: > "Brian B." <brian-pgsql@bbdab.org> writes: > > I use the FreeBSD PostgreSQL port and I set the option to use threads so > > that pl/python will work with PostgreSQL. > > What option is that, exactly? > > It's entirely possible that something has decided that the backend is > going to be multithreaded and is only giving the "main" thread a > 1MB-or-so stack. If so, I would regard this as a build error. We do > not want threading libraries linked into the backend. Apologies, due to message revising, I forgot to include the explanation of what this port setting entails. When setting the "LIBC_R" option, the FreeBSD port essentially sets CFLAGS to -D_THREAD_SAFE and LDFLAGS to -pthread. This is probably due to the Python procedural handler not being able to link with PostgreSQL until PostgreSQL is built pthread-aware. I could be wrong on all of this, but it seems to work as such. Unfortunately, this has some unforeseen broken behavior for PostgreSQL that was sorta hard to debug. :) Perhaps there is another way around it. I think the situation now is that I should converse with the FreeBSD port developer rather than using up your time. Thank you, Brian B.
"Brian B." <brian-pgsql@bbdab.org> writes: > When setting the "LIBC_R" option, the FreeBSD port essentially sets CFLAGS to > -D_THREAD_SAFE and LDFLAGS to -pthread. This is probably due to the Python > procedural handler not being able to link with PostgreSQL until PostgreSQL > is built pthread-aware. Hmm, is that a FreeBSD-specific restriction? I've not had any such trouble on Linux or Mac OS X. regards, tom lane
On Fri, Feb 11, 2005 at 10:05:36PM -0500, Tom Lane wrote: > "Brian B." <brian-pgsql@bbdab.org> writes: > > When setting the "LIBC_R" option, the FreeBSD port essentially sets CFLAGS to > > -D_THREAD_SAFE and LDFLAGS to -pthread. This is probably due to the Python > > procedural handler not being able to link with PostgreSQL until PostgreSQL > > is built pthread-aware. > > Hmm, is that a FreeBSD-specific restriction? I've not had any such > trouble on Linux or Mac OS X. I was citing that behavior from memory. It actually builds/installs OK until one decides to add the handler to a database: $ createlang plpythonu createlang: language installation failed: ERROR: could not load library "/usr/local/lib/postgresql/plpython.so": dlopen '/usr/local/lib/postgresql/plpython.so' failed. (/usr/local/lib/python2.4/config/libpython2.4.so: Undefined symbol "pthread_attr_destroy") If I build the backend with the pthread stuff, this step succeeds. Thanks, Brian B.
BTW, something that would be interesting is to figure out what the thread stack size actually is (I assume this is available in the FreeBSD docs) and experiment to find what is the maximum value max_stack_depth can be set to without letting infinite_recurse() dump core. regards, tom lane
On Fri, Feb 11, 2005 at 10:33:00PM -0500, Tom Lane wrote: > BTW, something that would be interesting is to figure out what the > thread stack size actually is (I assume this is available in the FreeBSD > docs) and experiment to find what is the maximum value max_stack_depth > can be set to without letting infinite_recurse() dump core. You may be onto something, there. After doing some searching, I have found the FreeBSD thread stack size as a default of 64KB! After Googling for things like "freebsd thread stack size set", it seems other projects are running into this situation with FreeBSD, as well as FreeBSD mailinglist chatter about the philosophy behind proper stack usage and whether to just match Linux's settings for this. Some notable topics on the matter, being: (Question about our default pthread stack size) http://lists.freebsd.org/pipermail/freebsd-threads/2004-November/002699.html ([PATCH] Dynamic thread stack size) http://lists.freebsd.org/pipermail/freebsd-threads/2005-January/002793.html Search results showed several other projects having this issue. Some try to workaround the problem by calling pthread_attr_setstacksize and using a value anywhere from a default doubling (of 128KB) up to around 1MB. Thanks, Brian B.
"Brian B." <brian-pgsql@bbdab.org> writes: > You may be onto something, there. After doing some searching, I have > found the FreeBSD thread stack size as a default of 64KB! Ugh :-(. That might be reasonable for a program that's actually using multiple threads, but a program that is not thread-aware at all shouldn't be forced into that model IMHO. As of now we are seeing one similar failure in the PG build farm, member osprey: http://www.pgbuildfarm.org/cgi-bin/show_status.pl It would seem that NetBSD 2.0 also has an unreasonably small default stack size. Can anyone check on what NetBSD is using? regards, tom lane
I haven't been following this thread closely, but if I understand correctly then the problems are the result of linking PostgreSQL against libc_r instead of libc on FreeBSD (at least FreeBSD 4.x), which was done in an attempt to make plpythonu work. Otherwise attempting to createlang plpythonu results in the following error: createlang: language installation failed: ERROR: could not load library "/usr/local/pgsql80/lib/plpython.so": dlopen '/usr/local/pgsql80/lib/plpython.so' failed. (/usr/local/lib/python2.4/config/libpython2.4.so: Undefined symbol "pthread_attr_destroy") Is that right? What I'm about to describe is a hack and it's probably wrong and dangerous, but I did it as an experiment anyway to see what would happen. I'm in no way suggesting that anybody should try it except on a test server that can tolerate failure and data corruption. I created a .so file with stub versions of the missing functions and used preload_libraries to load it into the PostgreSQL backend. I built PostgreSQL "normally", i.e., without linking against libc_r. It worked, at least in simple tests. Pthreads functions generally return 0 on success or some errno value on failure. Most functions have EINVAL as a documented return value, so I wrote all of the stub versions to return EINVAL -- I figured that they should report failure instead of success because they don't actually do anything. I also used ereport() to log the fact that a function was called. CREATE FUNCTION foo(integer) RETURNS integer AS $$ return args[0] $$ LANGUAGE plpythonu IMMUTABLE STRICT; SELECT foo(1234); NOTICE: sem_init() called NOTICE: sem_wait() called NOTICE: sem_post() called NOTICE: pthread_self() called ... foo ------ 1234 (1 row) The stub functions are called only when the language handler is first loaded -- subsequent calls to plpythonu functions don't print any of the notices, at least not that I've seen so far: SELECT foo(5678); foo ------ 5678 (1 row) It's interesting that although all of the stub functions report failure, the code runs anyway. It makes one wonder how thorough the error checking is. My pthread_phony.so file contains stub versions of the following functions; all were required to stop the linker from complaining: pthread_attr_destroy pthread_attr_init pthread_attr_setstacksize pthread_create pthread_detach pthread_self sem_destroy sem_init sem_post sem_trywait sem_wait Again, this was nothing more than an experiment, and so far I've done only a few simple tests. It could very well cause the system to crash and burn. Don't try it unless you can afford to have your database trashed. -- Michael Fuhr http://www.fuhr.org/~mfuhr/