Thread: A couple serious errors
Over the past couple days I started seeing errors like this in my server logs: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. Then tonight I started getting one like this: FATAL: semctl(0, 0, SETVAL, 0) failed: Identifier removed Postgresql is primarily being used for a web application here, and the first of these errors starts popping up after things have been running for 3-4 hours (I don't have any idea about the second because tonight is the first I've seen it). Once it starts, about 1 in every 3-4 web requests that hit the database fail. It may just be coincidence, but this all seemed to start a few nights ago right after I ran a full vacuum. Since then I've dumped and reloaded the database and have upgraded from 7.3 -> 7.4.6, but the problem persists. I've found that most of the time restarting Apache will cure the problem, but obviously this is less-than-ideal as a long-term solution. Searching google reveals that several other people have also seen error messages like the first, and that it is not the root of the problem but a symptom. I'm not sure which logging options to use to get the necessary details for solving this, but if this isn't enough to go on, just tell me what to uncomment and/or change I'll get you more info. Thanks, Mike
Mike Richards <mrmikerich@gmail.com> writes: > Over the past couple days I started seeing errors like this in my server logs: > WARNING: terminating connection because of crash of another server process This is a consequence of an earlier failure --- tell us about what happened just before that. > Then tonight I started getting one like this: > FATAL: semctl(0, 0, SETVAL, 0) failed: Identifier removed I'm thinking that you've got hardware problems (bad RAM). There isn't any way that Postgres would delete its semaphores during normal operation. regards, tom lane
On Thu, 18 Nov 2004 10:17:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Mike Richards <mrmikerich@gmail.com> writes: > > Over the past couple days I started seeing errors like this in my server logs: > > WARNING: terminating connection because of crash of another server process > > This is a consequence of an earlier failure --- tell us about what > happened just before that. Ok, it just happened again, and this is what showed up in the log just before: PANIC: XX000: stuck spinlock (0x4035a0a0) detected at lwlock.c:242 LOCATION: s_lock_stuck, s_lock.c:36 LOG: 00000: server process (PID 13804) was terminated by signal 6 LOCATION: LogChildExit, postmaster.c:2087 LOG: 00000: terminating any other active server processes LOCATION: CleanupProc, postmaster.c:2008 Earlier in the night, however, it also crashed and this is what preceded it: LOG: 00000: server process (PID 20195) was terminated by signal 11 LOCATION: LogChildExit, postmaster.c:2087 LOG: 00000: terminating any other active server processes LOCATION: CleanupProc, postmaster.c:2008 I don't know if it's relevant, but Postgres does bring itself back up after the crash: LOG: 00000: all server processes terminated; reinitializing LOCATION: reaper, postmaster.c:1920 LOG: 00000: database system was interrupted at 2004-11-18 15:06:58 GMT It'll probably crash again in 3-4 hours; if I get any more info I'll pass it on. > > > Then tonight I started getting one like this: > > FATAL: semctl(0, 0, SETVAL, 0) failed: Identifier removed > > I'm thinking that you've got hardware problems (bad RAM). There isn't > any way that Postgres would delete its semaphores during normal > operation.
Mike Richards <mrmikerich@gmail.com> writes: > PANIC: XX000: stuck spinlock (0x4035a0a0) detected at lwlock.c:242 > ... > LOG: 00000: server process (PID 20195) was terminated by signal 11 > ... > FATAL: semctl(0, 0, SETVAL, 0) failed: Identifier removed If you were getting just one of these then I might think you'd come across a previously unknown PG bug. Given the variety of failure modes, though, I'm strongly inclined to suspect that the common root cause is flaky RAM. Time to get out memtest86 or some such tool. regards, tom lane
On Thu, 18 Nov 2004 10:56:59 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Mike Richards <mrmikerich@gmail.com> writes: > > PANIC: XX000: stuck spinlock (0x4035a0a0) detected at lwlock.c:242 > > ... > > LOG: 00000: server process (PID 20195) was terminated by signal 11 > > ... > > FATAL: semctl(0, 0, SETVAL, 0) failed: Identifier removed > > If you were getting just one of these then I might think you'd come > across a previously unknown PG bug. Given the variety of failure modes, > though, I'm strongly inclined to suspect that the common root cause is > flaky RAM. Time to get out memtest86 or some such tool. Here's one more data point. This happens consistently when I try to run pg_dumpall (always at the same location): ERROR: XX000: cache lookup failed for attribute 8 of relation 16390 LOCATION: get_rte_attribute_type, parse_relation.c:1573 STATEMENT: SELECT i.indexrelid as indexreloid, coalesce(c.conname, t.relname) as indexrelname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef, i.indkey, i.indisclustered, t.relnatts as indnkeys, coalesce(c.contype, '0') as contype, coalesce(c.oid, '0') as conoid FROM pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid = i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid = t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid = c.oid) WHERE i.indrelid = '95585'::pg_catalog.oid ORDER BY indexrelname LOG: 08P01: unexpected EOF on client connection LOCATION: SocketBackend, postgres.c:281