Thread: A couple serious errors

A couple serious errors

From
Mike Richards
Date:
Over the past couple days I started seeing errors like this in my server logs:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.

Then tonight I started getting one like this:

FATAL:  semctl(0, 0, SETVAL, 0) failed: Identifier removed

Postgresql is primarily being used for a web application here, and the
first of these errors starts popping up after things have been running
for 3-4 hours (I don't have any idea about the second because tonight
is the first I've seen it). Once it starts, about 1 in every 3-4 web
requests that hit the database fail.

It may just be coincidence, but this all seemed to start a few nights
ago right after I ran a full vacuum. Since then I've dumped and
reloaded the database and have upgraded from 7.3 -> 7.4.6, but the
problem persists.

I've found that most of the time restarting Apache will cure the
problem, but obviously this is less-than-ideal as a long-term
solution.

Searching google reveals that several other people have also seen
error messages like the first, and that it is not the root of the
problem but a symptom. I'm not sure which logging options to use to
get the necessary details for solving this, but if this isn't enough
to go on, just tell me what to uncomment and/or change I'll get you
more info.

Thanks,
Mike

Re: A couple serious errors

From
Tom Lane
Date:
Mike Richards <mrmikerich@gmail.com> writes:
> Over the past couple days I started seeing errors like this in my server logs:
> WARNING:  terminating connection because of crash of another server process

This is a consequence of an earlier failure --- tell us about what
happened just before that.

> Then tonight I started getting one like this:
> FATAL:  semctl(0, 0, SETVAL, 0) failed: Identifier removed

I'm thinking that you've got hardware problems (bad RAM).  There isn't
any way that Postgres would delete its semaphores during normal
operation.

            regards, tom lane

Re: A couple serious errors

From
Mike Richards
Date:
On Thu, 18 Nov 2004 10:17:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mike Richards <mrmikerich@gmail.com> writes:
> > Over the past couple days I started seeing errors like this in my server logs:
> > WARNING:  terminating connection because of crash of another server process
>
> This is a consequence of an earlier failure --- tell us about what
> happened just before that.

Ok, it just happened again, and this is what showed up in the log just before:

PANIC:  XX000: stuck spinlock (0x4035a0a0) detected at lwlock.c:242
LOCATION:  s_lock_stuck, s_lock.c:36
LOG:  00000: server process (PID 13804) was terminated by signal 6
LOCATION:  LogChildExit, postmaster.c:2087
LOG:  00000: terminating any other active server processes
LOCATION:  CleanupProc, postmaster.c:2008

Earlier in the night, however, it also crashed and this is what preceded it:

LOG:  00000: server process (PID 20195) was terminated by signal 11
LOCATION:  LogChildExit, postmaster.c:2087
LOG:  00000: terminating any other active server processes
LOCATION:  CleanupProc, postmaster.c:2008

I don't know if it's relevant, but Postgres does bring itself back up
after the crash:

LOG:  00000: all server processes terminated; reinitializing
LOCATION:  reaper, postmaster.c:1920
LOG:  00000: database system was interrupted at 2004-11-18 15:06:58 GMT

It'll probably crash again in 3-4 hours; if I get any more info I'll pass it on.

>
> > Then tonight I started getting one like this:
> > FATAL:  semctl(0, 0, SETVAL, 0) failed: Identifier removed
>
> I'm thinking that you've got hardware problems (bad RAM).  There isn't
> any way that Postgres would delete its semaphores during normal
> operation.

Re: A couple serious errors

From
Tom Lane
Date:
Mike Richards <mrmikerich@gmail.com> writes:
> PANIC:  XX000: stuck spinlock (0x4035a0a0) detected at lwlock.c:242
> ...
> LOG:  00000: server process (PID 20195) was terminated by signal 11
> ...
> FATAL:  semctl(0, 0, SETVAL, 0) failed: Identifier removed

If you were getting just one of these then I might think you'd come
across a previously unknown PG bug.  Given the variety of failure modes,
though, I'm strongly inclined to suspect that the common root cause is
flaky RAM.  Time to get out memtest86 or some such tool.

            regards, tom lane

Re: A couple serious errors

From
Mike Richards
Date:
On Thu, 18 Nov 2004 10:56:59 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mike Richards <mrmikerich@gmail.com> writes:
> > PANIC:  XX000: stuck spinlock (0x4035a0a0) detected at lwlock.c:242
> > ...
> > LOG:  00000: server process (PID 20195) was terminated by signal 11
> > ...
> > FATAL:  semctl(0, 0, SETVAL, 0) failed: Identifier removed
>
> If you were getting just one of these then I might think you'd come
> across a previously unknown PG bug.  Given the variety of failure modes,
> though, I'm strongly inclined to suspect that the common root cause is
> flaky RAM.  Time to get out memtest86 or some such tool.

Here's one more data point. This happens consistently when I try to
run pg_dumpall (always at the same location):

ERROR:  XX000: cache lookup failed for attribute 8 of relation 16390
LOCATION:  get_rte_attribute_type, parse_relation.c:1573
STATEMENT:  SELECT i.indexrelid as indexreloid, coalesce(c.conname,
t.relname) as indexrelname, pg_catalog.pg_get_indexdef(i.indexrelid)
as indexdef, i.indkey, i.indisclustered, t.relnatts as indnkeys,
coalesce(c.contype, '0') as contype, coalesce(c.oid, '0') as conoid
FROM pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid =
i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid =
t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN
pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND
d.refobjid = c.oid) WHERE i.indrelid = '95585'::pg_catalog.oid ORDER
BY indexrelname
LOG:  08P01: unexpected EOF on client connection
LOCATION:  SocketBackend, postgres.c:281