Thread: Backend often crashing

Backend often crashing

From
"Guido Notari"
Date:
I have one of those nasty problems, with Postgres backend often crashing
with signal 11.

I'll do my best to give you the details:

Postgres is 7.2.1, more exactly is Debian package 7.2.1-2 from the Stable
(Woody) distribution -- I'm forwarding copy of this message to Debian's
package mantainer.

Postgres is running as a backend for a well known italian web site, running
on Zope (version 2.6.1 with psycopg Python adapter, v.1.1)

The problem is recent, i.e. never happened until last month or so, on this
same setup.
I have a few other machines, running the same software setup, but different
Zope sites, never experiencing any problem.

These are the relevant lines from syslog

Feb 20 14:43:53 speed postgres[13365]: [25] DEBUG:  server process (pid
15906) was terminated by signal 11
Feb 20 14:43:53 speed postgres[13365]: [26] DEBUG:  terminating any other
active server processes
Feb 20 14:43:53 speed postgres[15908]: [26-1] NOTICE:  Message from
PostgreSQL backend:
Feb 20 14:43:53 speed postgres[15908]: [26-2] ^IThe Postmaster has informed
me that some other backend
Feb 20 14:43:53 speed postgres[15908]: [26-3] ^Idied abnormally and
possibly corrupted shared memory.
Feb 20 14:43:53 speed postgres[15908]: [26-4] ^II have rolled back the
current transaction and am
Feb 20 14:43:53 speed postgres[15908]: [26-5] ^Igoing to terminate your
database system connection and exit.
Feb 20 14:43:53 speed postgres[15908]: [26-6] ^IPlease reconnect to the
database system and repeat your query.
Feb 20 14:43:53 speed postgres[15904]: [26-1] NOTICE:  Message from
PostgreSQL backend:
Feb 20 14:43:53 speed postgres[15904]: [26-2] ^IThe Postmaster has informed
me that some other backend
Feb 20 14:43:53 speed postgres[15904]: [26-3] ^Idied abnormally and
possibly corrupted shared memory.
Feb 20 14:43:53 speed postgres[15904]: [26-4] ^II have rolled back the
current transaction and am

I immediately thought of an hardware problem but, having an equivalent
machine online, I dumped the db and moved to that.
The problem manifestated at once on the other machine, which had previously
(~1 month before)  run the site without any error.

The two machines have the same software setup, but different Linux kernels
(2.4.19 vs 2.4.20, reiserfs vs ext3), and different hardware.

I cannot reproduce the problem reliably, though on the production machine
the database crashes many times an hour.

It _seems_ to be related to some mildly convoluted query (a SELECT only
query). Running that query manually, I managed to crash the backend only
once.
VACUUM FULL never gave any error, nor did pg_dump.

I obtained some (pretty large, ~90MB) core files from the crashes. The
backtrace is consistent between the files, here it is:

#0  0x08157e92 in MemoryContextReset ()
#1  0x08157eb9 in MemoryContextResetChildren ()
#2  0x08157e8b in MemoryContextReset ()
#3  0x08157eb9 in MemoryContextResetChildren ()
#4  0x08157e8b in MemoryContextReset ()
#5  0x080c5c88 in ExecScan ()
#6  0x080cb61a in ExecSeqScan ()
#7  0x080c4139 in ExecProcNode ()
#8  0x080cbe2c in ExecSort ()
#9  0x080c41c9 in ExecProcNode ()
#10 0x080ca630 in ExecMergeJoin ()
#11 0x080c4189 in ExecProcNode ()
#12 0x080cbe2c in ExecSort ()
#13 0x080c41c9 in ExecProcNode ()
#14 0x080cc0ae in ExecUnique ()
#15 0x080c41d9 in ExecProcNode ()
#16 0x080cd5d5 in ExecReScanSetParamPlan ()
#17 0x080c5cac in ExecScan ()
#18 0x080cd5f6 in ExecSubqueryScan ()
#19 0x080c4169 in ExecProcNode ()
#20 0x080c73f8 in ExecProcAppend ()
#21 0x080c4129 in ExecProcNode ()
#22 0x080cbe2c in ExecSort ()
#23 0x080c41c9 in ExecProcNode ()
#24 0x080cb9a6 in ExecSetOp ()
#25 0x080c41e9 in ExecProcNode ()
#26 0x080cbe2c in ExecSort ()
#27 0x080c41c9 in ExecProcNode ()
#28 0x080c30fe in ExecutorEnd ()
#29 0x080c2797 in ExecutorRun ()
#30 0x081104de in ProcessQuery ()
#31 0x0810ed70 in pg_exec_query_string ()
#32 0x0810fd5e in PostgresMain ()
#33 0x080f6d4e in ClosePostmasterPorts ()
#34 0x080f669f in ClosePostmasterPorts ()
#35 0x080f5882 in PostmasterMain ()
#36 0x080f5391 in PostmasterMain ()
#37 0x080d4e18 in main ()
#38 0x401d114f in __libc_start_main () from /lib/libc.so.6

Any hints are welcome.

ciao
Guido




Re: Backend often crashing

From
Jeff Ross
Date:
On Thu, 20 Feb 2003, Guido Notari wrote:

> I have one of those nasty problems, with Postgres backend often crashing
> with signal 11.
>
[snip]

> Any hints are welcome.
>
> ciao
> Guido
>
I think signal 11 is almost always bad ram.  If you do a Google search,
you'll see what I mean.

If this is recent, maybe the ram just went bad.


--
Jeff Ross
Open Vistas Networking, Inc.
http://www.openvistas.net


Re: Backend often crashing

From
Tom Lane
Date:
"Guido Notari" <gnotari@linkgroup.it> writes:
> I have one of those nasty problems, with Postgres backend often crashing
> with signal 11.

My gut feeling after looking at the stack trace is that it's a
memory-stomp kind of error (something writing on memory that doesn't
belong to it --- probably a buffer overrun).

One quick-and-dirty thing to try is updating to 7.2.4.  Neil fixed a few
potential buffer overrun conditions in 7.2.2 and 7.2.4.  I don't have a
lot of hope that this will eliminate the issue, but (a) it's easy to do
and (b) you ought to be on 7.2.4 anyway on general principles.

If that doesn't help then you'll need to either create a reproducible
example so someone else can debug it, or work on debugging it yourself,
or possibly let someone else have access to your machine to try to debug
it for you.  A first step in either the second or third choices is to
rebuild Postgres with --enable-debug so that you can get a more complete
stack trace.

            regards, tom lane

Re: Backend often crashing

From
Tony Grant
Date:
On Thu, 2003-02-20 at 15:49, Jeff Ross wrote:
> On Thu, 20 Feb 2003, Guido Notari wrote:
>
> > I have one of those nasty problems, with Postgres backend often crashing
> > with signal 11.
> >
> [snip]
>
> > Any hints are welcome.
> >
> > ciao
> > Guido
> >
> I think signal 11 is almost always bad ram.  If you do a Google search,
> you'll see what I mean.

Yes! this happened to me. Check your server hardware ASAP

Cheers

Tony Grant

--
www.tgds.net Library management software toolkit,
redhat linux on Sony Vaio C1XD,
Dreamweaver MX with Tomcat and PostgreSQL


Re: Backend often crashing

From
"Nigel J. Andrews"
Date:
On Thu, 20 Feb 2003, Jeff Ross wrote:

> On Thu, 20 Feb 2003, Guido Notari wrote:
>
> > I have one of those nasty problems, with Postgres backend often crashing
> > with signal 11.
> >
> [snip]
>
> > Any hints are welcome.
> >
> > ciao
> > Guido
> >
> I think signal 11 is almost always bad ram.  If you do a Google search,
> you'll see what I mean.
>
> If this is recent, maybe the ram just went bad.

My first thought until the statements about the second machine doing it as
well.

Have you considered the possibility that you are running into some sort of
resource limit? It could be your machines have a hard memory usage limit (they
are production machines after all). I don't think PostgreSQL would just die on
a Sig11 for that. It should log messages about memory allocation
failure. However, if your kernel is configured to do 'lazy' allocation it could
be that it's only when the memory comes to be used that the fault
happens. (Should that be a bus fault not a segmentation one though?)


--
Nigel J. Andrews


Re: Backend often crashing

From
"Guido Notari"
Date:
On 21/02/2003 09.11.14 pgsql-general-owner wrote:

>  > > I have one of those nasty problems, with Postgres backend often
crashing
>  > > with signal 11.
>  > >
>  > [snip]

>  > I think signal 11 is almost always bad ram.  If you do a Google
search,
>  > you'll see what I mean.
>  >
>  > If this is recent, maybe the ram just went bad.
>
>  My first thought until the statements about the second machine doing it
as
>  well.
>
>  Have you considered the possibility that you are running into some sort
of
>  resource limit? It could be your machines have a hard memory usage limit
(they
>  are production machines after all). I don't think PostgreSQL would just
die on
>  a Sig11 for that. It should log messages about memory allocation
>  failure. However, if your kernel is configured to do 'lazy' allocation
it could
>  be that it's only when the memory comes to be used that the fault
>  happens. (Should that be a bus fault not a segmentation one though?)

I don't believe at all about bad ram in two machines at the same time.

Resource limits? I don't think so, though we can't rule anything out...

The two machies have similar configuration, but different amounts of ram
(640MB vs 1GB).

And, I still have a strange feeling about machine A running the site
without any problem manifestating itself (AFAIK), then
switching site to machine B for one month, problem manifestated, then
moved back to machine A, problem manifestated on this machine

Memory limits, Postgres configuration etc never changed between switches.

Machine load wouldn't seem to be an issue, one crash occured today at 07:34
am, I don't thnk thew was any significant load at the time.

ciao
Guido