Thread: Backend often crashing

Backend often crashing

From
"Guido Notari"
Date:
My apologies for my previous post. I wasn't aware of the webmail program
producing such a mess. Never again.



I'm resending the message I sent to the list a few weeks ago, as a reminder.

I'm adding my findings at the bottom.



>I have one of those nasty problems, with Postgres backend often crashing
>with signal 11.
>
>I'll do my best to give you the details:
>
>Postgres is 7.2.1, more exactly is Debian package 7.2.1-2 from the Stable
>(Woody) distribution -- I'm forwarding copy of this message to Debian's
>package mantainer.
>
>Postgres is running as a backend for a well known italian web site, running
>on Zope (version 2.6.1 with psycopg Python adapter, v.1.1)
>
>The problem is recent, i.e. never happened until last month or so, on this
>same setup.
>I have a few other machines, running the same software setup, but different
>Zope sites, never experiencing any problem.
>
>These are the relevant lines from syslog
>
>Feb 20 14:43:53 speed postgres[13365]: [25] DEBUG:  server process (pid
>15906) was terminated by signal 11
>Feb 20 14:43:53 speed postgres[13365]: [26] DEBUG:  terminating any other
>active server processes
>Feb 20 14:43:53 speed postgres[15908]: [26-1] NOTICE:  Message from
>PostgreSQL backend:
>Feb 20 14:43:53 speed postgres[15908]: [26-2] ^IThe Postmaster has informed
>me that some other backend
>Feb 20 14:43:53 speed postgres[15908]: [26-3] ^Idied abnormally and
>possibly corrupted shared memory.
>Feb 20 14:43:53 speed postgres[15908]: [26-4] ^II have rolled back the
>current transaction and am
>Feb 20 14:43:53 speed postgres[15908]: [26-5] ^Igoing to terminate your
>database system connection and exit.
>Feb 20 14:43:53 speed postgres[15908]: [26-6] ^IPlease reconnect to the
>database system and repeat your query.
>Feb 20 14:43:53 speed postgres[15904]: [26-1] NOTICE:  Message from
>PostgreSQL backend:
>Feb 20 14:43:53 speed postgres[15904]: [26-2] ^IThe Postmaster has informed
>me that some other backend
>Feb 20 14:43:53 speed postgres[15904]: [26-3] ^Idied abnormally and
>possibly corrupted shared memory.
>Feb 20 14:43:53 speed postgres[15904]: [26-4] ^II have rolled back the
>current transaction and am
>
>I immediately thought of an hardware problem but, having an equivalent
>machine online, I dumped the db and moved to that.
>The problem manifestated at once on the other machine, which had previously
>(~1 month before)  run the site without any error.
>
>The two machines have the same software setup, but different Linux kernels
>(2.4.19 vs 2.4.20, reiserfs vs ext3), and different hardware.
>
>I cannot reproduce the problem reliably, though on the production machine
>the database crashes many times an hour.
>
>It _seems_ to be related to some mildly convoluted query (a SELECT only
>query). Running that query manually, I managed to crash the backend only
>once.
>VACUUM FULL never gave any error, nor did pg_dump.
>
>I obtained some (pretty large, ~90MB) core files from the crashes. The
>backtrace is consistent between the files, here it is:
>
>#0  0x08157e92 in MemoryContextReset ()
>#1  0x08157eb9 in MemoryContextResetChildren ()
>#2  0x08157e8b in MemoryContextReset ()
>#3  0x08157eb9 in MemoryContextResetChildren ()
>#4  0x08157e8b in MemoryContextReset ()
>#5  0x080c5c88 in ExecScan ()
>#6  0x080cb61a in ExecSeqScan ()
>#7  0x080c4139 in ExecProcNode ()
>#8  0x080cbe2c in ExecSort ()
>#9  0x080c41c9 in ExecProcNode ()
>#10 0x080ca630 in ExecMergeJoin ()
>#11 0x080c4189 in ExecProcNode ()
>#12 0x080cbe2c in ExecSort ()
>#13 0x080c41c9 in ExecProcNode ()
>#14 0x080cc0ae in ExecUnique ()
>#15 0x080c41d9 in ExecProcNode ()
>#16 0x080cd5d5 in ExecReScanSetParamPlan ()
>#17 0x080c5cac in ExecScan ()
>#18 0x080cd5f6 in ExecSubqueryScan ()
>#19 0x080c4169 in ExecProcNode ()
>#20 0x080c73f8 in ExecProcAppend ()
>#21 0x080c4129 in ExecProcNode ()
>#22 0x080cbe2c in ExecSort ()
>#23 0x080c41c9 in ExecProcNode ()
>#24 0x080cb9a6 in ExecSetOp ()
>#25 0x080c41e9 in ExecProcNode ()
>#26 0x080cbe2c in ExecSort ()
>#27 0x080c41c9 in ExecProcNode ()
>#28 0x080c30fe in ExecutorEnd ()
>#29 0x080c2797 in ExecutorRun ()
>#30 0x081104de in ProcessQuery ()
>#31 0x0810ed70 in pg_exec_query_string ()
>#32 0x0810fd5e in PostgresMain ()
>#33 0x080f6d4e in ClosePostmasterPorts ()
>#34 0x080f669f in ClosePostmasterPorts ()
>#35 0x080f5882 in PostmasterMain ()
>#36 0x080f5391 in PostmasterMain ()
>#37 0x080d4e18 in main ()
>#38 0x401d114f in __libc_start_main () from /lib/libc.so.6



As it turned out, switching to version 7.2.4 gave no result. The errors are
still there.

But, now, at least I've a clue. It seemed that the error was triggered
almost exclusively by a search funcion on the web site.

The code turned out to call extensively the to_ascii() function of Postgres.
I have reason to suspect that the database contains, in text fields,
characters which do not pertain to the selected encoding (LATIN1).

So, I fancied, one possible culprit was the to_ascii function chocking on
some strange character.

I replaced the occurences of to_ascii with a custum function that calls
to_ascii only on the result of a translate, which in turn converts some
strange (russian?) characters to plain ascii.

The errors dropped down, the few remaining don't seem to be related to that
search function.

Of course, this is not conclusive, I've yet to reproduce reliably the error
on a single, selected data row, but I think what i found it's worth
reporting.



Thanks to the developing team!



ciao

Guido


Re: Backend often crashing

From
Tom Lane
Date:
"Guido Notari" <gnotari@linkgroup.it> writes:
> I replaced the occurences of to_ascii with a custum function that calls
> to_ascii only on the result of a translate, which in turn converts some
> strange (russian?) characters to plain ascii.

Ooohhh ... looking at the source code shows that to_ascii processes one
byte too many, which would lead to clobbering the next byte of memory,
which would quite possibly cause your problem.

In 7.2.4, the bug is at line 114 of src/backend/utils/adt/ascii.c:

    for (x = src; x <= src_end; x++)

should be

    for (x = src; x < src_end; x++)

            regards, tom lane


Re: Backend often crashing

From
Dennis Gearon
Date:
WOW,
    Open source at it's best. A guy has a problem, goes through all the functions,
delivers the suspects, and another open source worker gets it fixed, in less
than 24 hours.
    What a way of life :-)

Tom Lane wrote:
> "Guido Notari" <gnotari@linkgroup.it> writes:
>
>>I replaced the occurences of to_ascii with a custum function that calls
>>to_ascii only on the result of a translate, which in turn converts some
>>strange (russian?) characters to plain ascii.
>
>
> Ooohhh ... looking at the source code shows that to_ascii processes one
> byte too many, which would lead to clobbering the next byte of memory,
> which would quite possibly cause your problem.
>
> In 7.2.4, the bug is at line 114 of src/backend/utils/adt/ascii.c:
>
>     for (x = src; x <= src_end; x++)
>
> should be
>
>     for (x = src; x < src_end; x++)
>
>             regards, tom lane
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: Backend often crashing

From
Tom Lane
Date:
Dennis Gearon <gearond@cvc.net> writes:
> WOW,
>     Open source at it's best. A guy has a problem, goes through all the functions,
> delivers the suspects, and another open source worker gets it fixed, in less
> than 24 hours.
>     What a way of life :-)

IMHO, the *real* advantage of open source is that Guido can patch it for
himself, without having to wait for us to put out a new release.

This is something that I think RPM distribution largely loses.
Certainly if you only know how to install binary RPMs, you're dependent
on the upstream folks to propagate fixes.  It might be okay if you build
from a source RPM --- can anyone comment on how hard it is to merge
locally-supplied diffs into a source RPM?  I've never tried to ...

            regards, tom lane


Re: Backend often crashing

From
"scott.marlowe"
Date:
On Wed, 2 Apr 2003, Tom Lane wrote:

> Dennis Gearon <gearond@cvc.net> writes:
> > WOW,
> >     Open source at it's best. A guy has a problem, goes through all the functions,
> > delivers the suspects, and another open source worker gets it fixed, in less
> > than 24 hours.
> >     What a way of life :-)
>
> IMHO, the *real* advantage of open source is that Guido can patch it for
> himself, without having to wait for us to put out a new release.
>
> This is something that I think RPM distribution largely loses.
> Certainly if you only know how to install binary RPMs, you're dependent
> on the upstream folks to propagate fixes.  It might be okay if you build
> from a source RPM --- can anyone comment on how hard it is to merge
> locally-supplied diffs into a source RPM?  I've never tried to ...

It's probably the easiest part when it comes to making your own RPMs.


Re: Backend often crashing

From
Alvaro Herrera
Date:
On Wed, Apr 02, 2003 at 04:59:06PM -0500, Tom Lane wrote:

> This is something that I think RPM distribution largely loses.
> Certainly if you only know how to install binary RPMs, you're dependent
> on the upstream folks to propagate fixes.  It might be okay if you build
> from a source RPM --- can anyone comment on how hard it is to merge
> locally-supplied diffs into a source RPM?  I've never tried to ...

It's not that hard.  With the upstream guys' supplied source RPM and the
open source worker's patch, it's a matter of adding a pointer to the
patch to the SPEC and rebuild.  But surely most people used to binary
RPMs hasn't ever built one from SRPM, let alone plain source.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"In a specialized industrial society, it would be a disaster
to have kids running around loose." (Paul Graham)