Thread: BUG #1473: Backend bus error, possibly due to ANALYZE

BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
The following bug has been logged online:

Bug reference:      1473
Logged by:          Brian B.
Email address:      brian-pgsql@bbdab.org
PostgreSQL version: 8.0.1
Operating system:   FreeBSD 4.11-STABLE
Description:        Backend bus error, possibly due to ANALYZE
Details:

Hello,

I am loading some spam/ham data/tokens, to be used for the dspam anti-spam
software, into PostgreSQL. After a few hours of inserting and updating the
existing data, the backend crashes with a signal 10 (bus error). I am also
running an ANALYZE command to help keep the indexes up to date every 10
minutes. I have rebuilt PostgreSQL with debugging enabled and have a
resulting core dump. I am not sure who to send this to, however. The
backtrace using the core file is as follows:

#0  0x80c4b52 in transformExpr (pstate=0x86fa630, expr=0x85ec148)
    at parse_expr.c:81
#1  0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec1d0)
    at parse_expr.c:270
#2  0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec258)
    at parse_expr.c:270
#3  0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec2e0)
    at parse_expr.c:270

[... repeat above several thousand times ...]

#8139 0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x86fa530)
    at parse_expr.c:270
#8140 0x80c50c2 in transformExpr (pstate=0x86fa630, expr=0x86fa558)
    at parse_expr.c:257
#8141 0x80c40e3 in transformWhereClause (pstate=0x86fa630, clause=0x86fa558,

    constructName=0x81f8e75 "WHERE") at parse_clause.c:1013
#8142 0x80b178d in transformSelectStmt (pstate=0x86fa630, stmt=0x86fa580)
    at analyze.c:1826
#8143 0x80af817 in transformStmt (pstate=0x86fa630, parseTree=0x86fa580,
    extras_before=0xbfbfe6f0, extras_after=0xbfbfe6f4) at analyze.c:371
#8144 0x80af5ba in do_parse_analyze (parseTree=0x86fa580, pstate=0x86fa630)
    at analyze.c:245
#8145 0x80af4ba in parse_analyze (parseTree=0x86fa580, paramTypes=0x0,
    numParams=0) at analyze.c:169
#8146 0x816fc9e in pg_analyze_and_rewrite (parsetree=0x86fa580,
    paramTypes=0x0, numParams=0) at postgres.c:555
#8147 0x8170061 in exec_simple_query (
    query_string=0x83cd018 "SELECT uid, token, spam_hits, innocent_hits FROM
dspam_token_data WHERE uid = '1000' AND token IN
('913609592713134710','6655100721407275609','-7433241509246025544','61795992
61633421689','588386968494"...)
    at postgres.c:872
#8148 0x8172713 in PostgresMain (argc=4, argv=0x82e4258,
    username=0x82e4238 "dspam") at postgres.c:3007
#8149 0x814dc49 in BackendRun (port=0x82e2000) at postmaster.c:2816
#8150 0x814d47e in BackendStartup (port=0x82e2000) at postmaster.c:2452
#8151 0x814b9ba in ServerLoop () at postmaster.c:1199
#8152 0x814b31a in PostmasterMain (argc=3, argv=0xbfbff3d0) at
postmaster.c:918
#8153 0x811f2b9 in main (argc=3, argv=0xbfbff3d0) at main.c:268
(gdb)

I have compiled PostgreSQL with pthread for pl/python capability.


Thanks,

Brian

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Neil Conway
Date:
On Thu, 2005-02-10 at 02:37 +0000, Brian B. wrote:
> I am loading some spam/ham data/tokens, to be used for the dspam anti-spam
> software, into PostgreSQL. After a few hours of inserting and updating the
> existing data, the backend crashes with a signal 10 (bus error). I am also
> running an ANALYZE command to help keep the indexes up to date every 10
> minutes.

As far as I can tell, ANALYZE is not at issue here.

> I have rebuilt PostgreSQL with debugging enabled and have a
> resulting core dump. I am not sure who to send this to, however. The
> backtrace using the core file is as follows:
>
> #0  0x80c4b52 in transformExpr (pstate=0x86fa630, expr=0x85ec148)
>     at parse_expr.c:81
> #1  0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec1d0)
>     at parse_expr.c:270
> #2  0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec258)
>     at parse_expr.c:270
> #3  0x80c5126 in transformExpr (pstate=0x86fa630, expr=0x85ec2e0)
>     at parse_expr.c:270
>
> [... repeat above several thousand times ...]

It seems what's happening here is that dspam is submitting a query with
many thousands of elements in the IN clause. In the parser, we transform
"foo IN (a, b, c)" into "foo = a OR foo = b OR foo = c", and then
recurse for each element of the OR expression and eventually run out of
stack space. (Note that this will actually be worse in HEAD, since a
refactoring I applied will mean we consume two stack frames for each
expression.)

A workaround would be to increase PostgreSQL's stack size.

Perhaps it would be worth considering representing IN lists as a
distinct expression type, at least in the parser. Then the
transformExpr() code would look like:

foreach (element of IN list)
    transformExpr(element);
    ... do whatever else ...

so we wouldn't need to recurse. We could then transform the new
expression type into a list of OR clauses at this point.

-Neil

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"John Hansen"
Date:
> It seems what's happening here is that dspam is submitting a=20
> query with many thousands of elements in the IN clause. In=20
> the parser, we transform "foo IN (a, b, c)" into "foo =3D a OR=20
> foo =3D b OR foo =3D c", and then recurse for each element of the=20
> OR expression and eventually run out of stack space. (Note=20
> that this will actually be worse in HEAD, since a refactoring=20
> I applied will mean we consume two stack frames for each
> expression.)
>=20
> A workaround would be to increase PostgreSQL's stack size.
>=20
> Perhaps it would be worth considering representing IN lists=20
> as a distinct expression type, at least in the parser. Then the
> transformExpr() code would look like:

Just like I showed earlier on large IN () lists are useless....
Instead I use the UNNEST function I posted earlier (see http://archives.pos=
tgresql.org/pgsql-hackers/2004-11/msg00327.php) like so:

Select id from table inner join unnest(array[1,2,3,4,...]) as d(id) using(i=
d);

Not only does it not crash the backend,. But it also proved to be faster, t=
ho admittedly not much.


... John

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
> I am loading some spam/ham data/tokens, to be used for the dspam anti-spam
> software, into PostgreSQL. After a few hours of inserting and updating the
> existing data, the backend crashes with a signal 10 (bus error).

Judging by the trace, this doesn't depend on your data at all, just on
the specific query and the declarations of the tables it references.
Please see if you can extract a self-contained test case that way.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> It seems what's happening here is that dspam is submitting a query with
> many thousands of elements in the IN clause. In the parser, we transform
> "foo IN (a, b, c)" into "foo = a OR foo = b OR foo = c", and then
> recurse for each element of the OR expression and eventually run out of
> stack space.

There is a check_stack_depth call in there, so this could only be the
explanation if max_stack_depth is set too high for the actual
stack depth limit.  What's the platform, and what ulimit values is the
postmaster started under?

> Perhaps it would be worth considering representing IN lists as a
> distinct expression type, at least in the parser. Then the
> transformExpr() code would look like:

> foreach (element of IN list)
>     transformExpr(element);
>     ... do whatever else ...

> so we wouldn't need to recurse. We could then transform the new
> expression type into a list of OR clauses at this point.

Waste of time unless you were to propagate this representation all the
way through; as described above you'd merely be postponing the stack
depth problem to a later phase.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
Hello,

On Thu, Feb 10, 2005 at 10:08:14AM -0500, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > It seems what's happening here is that dspam is submitting a query with
> > many thousands of elements in the IN clause. In the parser, we transform
> > "foo IN (a, b, c)" into "foo = a OR foo = b OR foo = c", and then
> > recurse for each element of the OR expression and eventually run out of
> > stack space.
>
> There is a check_stack_depth call in there, so this could only be the
> explanation if max_stack_depth is set too high for the actual
> stack depth limit.  What's the platform, and what ulimit values is the
> postmaster started under?

FreeBSD 4.11 on x86 using PostgreSQL 8.0.1

% limits
Resource limits (current):
  cputime          infinity secs
  filesize           512000 kb
  datasize           524288 kb
  stacksize           65536 kb
  coredumpsize       307200 kb
  memoryuse-cur      458752 kb
  memorylocked-cur   458752 kb
  maxprocesses          512
  openfiles            7351
  sbsize           infinity bytes
  vmemoryuse       infinity kb

max_stack_depth is at the default commented value of 2048

I can bump this value up to test it if desired, it just takes
a while to get to the point of the backend crash scenario.

I feed about 2300 good pieces of email using the dspam corpus
feed program. Around 200 email messages left to feed,
the backend will want to crash around that point. I will try
to narrow down what query is being ran at that point. Perhaps
I can extract it from the core dump.


Thanks,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
> On Thu, Feb 10, 2005 at 10:08:14AM -0500, Tom Lane wrote:
>> There is a check_stack_depth call in there, so this could only be the
>> explanation if max_stack_depth is set too high for the actual
>> stack depth limit.  What's the platform, and what ulimit values is the
>> postmaster started under?

> FreeBSD 4.11 on x86 using PostgreSQL 8.0.1

> % limits
> Resource limits (current):
>   stacksize           65536 kb

Hmm, 64meg should certainly be far past where check_stack_depth will
start to complain, so there's something else going on here.  (BUT:
the ulimits you see interactively aren't necessarily what the postmaster
was started under.  You might want to confirm that the results are the
same if you put the limits call into the boot script that launches
the postmaster...)

> the backend will want to crash around that point. I will try
> to narrow down what query is being ran at that point. Perhaps
> I can extract it from the core dump.

See debug_query_string.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Neil Conway
Date:
Tom Lane wrote:
> Hmm, 64meg should certainly be far past where check_stack_depth will
> start to complain, so there's something else going on here.

Right; if those really are the stack size limits for the crashing
backend, I guess my initial analysis must have been mistaken. But I'm
mystified as to what else the problem could possibly be...

-Neil

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Thu, Feb 10, 2005 at 05:24:12PM -0500, Tom Lane wrote:
> Hmm, 64meg should certainly be far past where check_stack_depth will
> start to complain, so there's something else going on here.  (BUT:
> the ulimits you see interactively aren't necessarily what the postmaster
> was started under.  You might want to confirm that the results are the
> same if you put the limits call into the boot script that launches
> the postmaster...)
I put limits in the startup script for PostgreSQL and came up with the same
64MB stacksize value.

> See debug_query_string.
I've got the gdb output of printing debug_query_string and the text saved
from script(1) is about 267KB in size. Should I send this to you off-list?


Thanks,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
>> See debug_query_string.
> I've got the gdb output of printing debug_query_string and the text saved
> from script(1) is about 267KB in size. Should I send this to you off-list?

Please.  Also I'll need the schemas for the tables the query uses.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Fri, Feb 11, 2005 at 04:56:16PM -0500, Tom Lane wrote:
> regression=# create function infinite_recurse() returns int as '
> regression'# select infinite_recurse()' language sql;
> regression=# \set VERBOSITY terse
> regression=# select infinite_recurse();
> ERROR:  stack depth limit exceeded
>
> and see if you get the proper error or a core dump?

This makes the backend core dump in about 5-10 seconds. It looks like your
analysis is correct. Excellent work. I am not sure if it's my particular
FreeBSD installation that is screwing up or if it's FreeBSD in general.
Did you happen to test with FreeBSD on one of your test machines?

> If it dumps core, then the thing to look at is check_stack_depth() in
> src/backend/tcop/postgres.c.  Maybe your compiler is bogusly optimizing
> the address arithmetic there?

It is possible. Using gcc 2.95.4. I will eliminate all the optimization
options when I recompile PostgreSQL and see what happens.


Thanks,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
> Did you happen to test with FreeBSD on one of your test machines?

No, I don't have any BSD machines here.  However, I've added this test
case to the regression tests, so in a few hours we'll have a spectrum
of results from the PG build farm.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
>> Did you happen to test with FreeBSD on one of your test machines?

> No, I don't have any BSD machines here.  However, I've added this test
> case to the regression tests, so in a few hours we'll have a spectrum
> of results from the PG build farm.

FWIW, I see a "pass" from buildfarm member cockatoo, which claims to be
FreeBSD 4.10-STABLE gcc 2.95.4 x86 ... that's at least pretty close to
your setup, no?

http://www.pgbuildfarm.org/cgi-bin/show_status.pl

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Fri, Feb 11, 2005 at 06:46:55PM -0500, Tom Lane wrote:
> >> Did you happen to test with FreeBSD on one of your test machines?
>
> > No, I don't have any BSD machines here.  However, I've added this test
> > case to the regression tests, so in a few hours we'll have a spectrum
> > of results from the PG build farm.
>
> FWIW, I see a "pass" from buildfarm member cockatoo, which claims to be
> FreeBSD 4.10-STABLE gcc 2.95.4 x86 ... that's at least pretty close to
> your setup, no?

Yup ... that's really close. I'm going to do a few tests to try to figure
out what's going on with my environment and let you know if I come up with
anything odd.


Thanks for your time!

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Fri, Feb 11, 2005 at 06:54:31PM -0500, I wrote:
> Yup ... that's really close. I'm going to do a few tests to try to figure
> out what's going on with my environment and let you know if I come up with
> anything odd.

I think I have figured out the culprit.

I use the FreeBSD PostgreSQL port and I set the option to use threads so
that pl/python will work with PostgreSQL. If I unset this option and rebuild
the port, I will receive the normal error message but no crash. Just to
make sure it was not the port's fault, I build the source by hand without
any port-specific patches. The same behavior occurs.

So I am not sure if, again, it's my libc_r or something gets messed up in
the PostgreSQL code dealing with the stack/recursion, when I have pthread
libs linked in. I am guessing it is just my particular environment, though.


Thanks,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
> I use the FreeBSD PostgreSQL port and I set the option to use threads so
> that pl/python will work with PostgreSQL.

What option is that, exactly?

It's entirely possible that something has decided that the backend is
going to be multithreaded and is only giving the "main" thread a
1MB-or-so stack.  If so, I would regard this as a build error.  We do
not want threading libraries linked into the backend.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Fri, Feb 11, 2005 at 09:46:50PM -0500, Tom Lane wrote:
> "Brian B." <brian-pgsql@bbdab.org> writes:
> > I use the FreeBSD PostgreSQL port and I set the option to use threads so
> > that pl/python will work with PostgreSQL.
>
> What option is that, exactly?
>
> It's entirely possible that something has decided that the backend is
> going to be multithreaded and is only giving the "main" thread a
> 1MB-or-so stack.  If so, I would regard this as a build error.  We do
> not want threading libraries linked into the backend.

Apologies, due to message revising, I forgot to include the explanation of what
this port setting entails.

When setting the "LIBC_R" option, the FreeBSD port essentially sets CFLAGS to
-D_THREAD_SAFE and LDFLAGS to -pthread. This is probably due to the Python
procedural handler not being able to link with PostgreSQL until PostgreSQL
is built pthread-aware. I could be wrong on all of this, but it seems to
work as such. Unfortunately, this has some unforeseen broken behavior for
PostgreSQL that was sorta hard to debug. :)

Perhaps there is another way around it. I think the situation now is that I
should converse with the FreeBSD port developer rather than using up your time.


Thank you,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
> When setting the "LIBC_R" option, the FreeBSD port essentially sets CFLAGS to
> -D_THREAD_SAFE and LDFLAGS to -pthread. This is probably due to the Python
> procedural handler not being able to link with PostgreSQL until PostgreSQL
> is built pthread-aware.

Hmm, is that a FreeBSD-specific restriction?  I've not had any such
trouble on Linux or Mac OS X.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Fri, Feb 11, 2005 at 10:05:36PM -0500, Tom Lane wrote:
> "Brian B." <brian-pgsql@bbdab.org> writes:
> > When setting the "LIBC_R" option, the FreeBSD port essentially sets CFLAGS to
> > -D_THREAD_SAFE and LDFLAGS to -pthread. This is probably due to the Python
> > procedural handler not being able to link with PostgreSQL until PostgreSQL
> > is built pthread-aware.
>
> Hmm, is that a FreeBSD-specific restriction?  I've not had any such
> trouble on Linux or Mac OS X.

I was citing that behavior from memory. It actually builds/installs OK until
one decides to add the handler to a database:

$ createlang plpythonu
createlang: language installation failed: ERROR:  could not load library
"/usr/local/lib/postgresql/plpython.so": dlopen
'/usr/local/lib/postgresql/plpython.so' failed.
(/usr/local/lib/python2.4/config/libpython2.4.so: Undefined symbol
"pthread_attr_destroy")

If I build the backend with the pthread stuff, this step succeeds.


Thanks,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
BTW, something that would be interesting is to figure out what the
thread stack size actually is (I assume this is available in the FreeBSD
docs) and experiment to find what is the maximum value max_stack_depth
can be set to without letting infinite_recurse() dump core.

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
"Brian B."
Date:
On Fri, Feb 11, 2005 at 10:33:00PM -0500, Tom Lane wrote:
> BTW, something that would be interesting is to figure out what the
> thread stack size actually is (I assume this is available in the FreeBSD
> docs) and experiment to find what is the maximum value max_stack_depth
> can be set to without letting infinite_recurse() dump core.

You may be onto something, there. After doing some searching, I have
found the FreeBSD thread stack size as a default of 64KB! After Googling
for things like "freebsd thread stack size set", it seems other projects are
running into this situation with FreeBSD, as well as FreeBSD mailinglist
chatter about the philosophy behind proper stack usage and whether to just
match Linux's settings for this.

Some notable topics on the matter, being:

(Question about our default pthread stack size)
http://lists.freebsd.org/pipermail/freebsd-threads/2004-November/002699.html

([PATCH] Dynamic thread stack size)
http://lists.freebsd.org/pipermail/freebsd-threads/2005-January/002793.html

Search results showed several other projects having this issue. Some
try to workaround the problem by calling pthread_attr_setstacksize and using
a value anywhere from a default doubling (of 128KB) up to around 1MB.


Thanks,

Brian B.

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Tom Lane
Date:
"Brian B." <brian-pgsql@bbdab.org> writes:
> You may be onto something, there. After doing some searching, I have
> found the FreeBSD thread stack size as a default of 64KB!

Ugh :-(.  That might be reasonable for a program that's actually using
multiple threads, but a program that is not thread-aware at all
shouldn't be forced into that model IMHO.

As of now we are seeing one similar failure in the PG build farm,
member osprey: http://www.pgbuildfarm.org/cgi-bin/show_status.pl
It would seem that NetBSD 2.0 also has an unreasonably small default
stack size.  Can anyone check on what NetBSD is using?

            regards, tom lane

Re: BUG #1473: Backend bus error, possibly due to ANALYZE

From
Michael Fuhr
Date:
I haven't been following this thread closely, but if I understand
correctly then the problems are the result of linking PostgreSQL
against libc_r instead of libc on FreeBSD (at least FreeBSD 4.x),
which was done in an attempt to make plpythonu work.  Otherwise
attempting to createlang plpythonu results in the following error:

createlang: language installation failed:
ERROR:  could not load library "/usr/local/pgsql80/lib/plpython.so":
dlopen '/usr/local/pgsql80/lib/plpython.so' failed.
(/usr/local/lib/python2.4/config/libpython2.4.so: Undefined symbol "pthread_attr_destroy")

Is that right?

What I'm about to describe is a hack and it's probably wrong and
dangerous, but I did it as an experiment anyway to see what would
happen.  I'm in no way suggesting that anybody should try it except
on a test server that can tolerate failure and data corruption.

I created a .so file with stub versions of the missing functions
and used preload_libraries to load it into the PostgreSQL backend.
I built PostgreSQL "normally", i.e., without linking against libc_r.

It worked, at least in simple tests.

Pthreads functions generally return 0 on success or some errno value
on failure.  Most functions have EINVAL as a documented return value,
so I wrote all of the stub versions to return EINVAL -- I figured that
they should report failure instead of success because they don't
actually do anything.  I also used ereport() to log the fact that a
function was called.

CREATE FUNCTION foo(integer) RETURNS integer AS $$
return args[0]
$$ LANGUAGE plpythonu IMMUTABLE STRICT;

SELECT foo(1234);
NOTICE:  sem_init() called
NOTICE:  sem_wait() called
NOTICE:  sem_post() called
NOTICE:  pthread_self() called
...
 foo
------
 1234
(1 row)

The stub functions are called only when the language handler is
first loaded -- subsequent calls to plpythonu functions don't print
any of the notices, at least not that I've seen so far:

SELECT foo(5678);
 foo
------
 5678
(1 row)

It's interesting that although all of the stub functions report
failure, the code runs anyway.  It makes one wonder how thorough
the error checking is.

My pthread_phony.so file contains stub versions of the following
functions; all were required to stop the linker from complaining:

pthread_attr_destroy
pthread_attr_init
pthread_attr_setstacksize
pthread_create
pthread_detach
pthread_self
sem_destroy
sem_init
sem_post
sem_trywait
sem_wait

Again, this was nothing more than an experiment, and so far I've
done only a few simple tests.  It could very well cause the system
to crash and burn.  Don't try it unless you can afford to have your
database trashed.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/