Thread: process exited with status 11 after XLogFlush: request is not satisfied
process exited with status 11 after XLogFlush: request is not satisfied
From
"Bjoern Metzdorf"
Date:
Hello, I'm running postgres 7.1.2 / 7.1.3 on Linux 2.2.17 (Debian Potato). Tonight we were suffering from FATAL 2: XLogFlush: request is not satisfied in the logfile and crashes afterwards: Server process (pid 26888) exited with status 512 at Wed Jan 30 09:51:37 2002 This was with 7.1.2. As stated in the archives I ran pg_resetxlog on the data directory (after backup). After that, the "XLogFlush: request is not satisfied" vanished, but now we are suffering from several "Server process (pid 5060) exited with status 11". At first I suspected harddrive or memory problems, but the disks (ICP Vortex Raid 5) seem to be fine, as well as the memory (I tested with memtester: http://www.qcc.sk.ca/~charlesc/software/memtester/) Then I upgraded to 7.1.3, but the problem persists, here are the exact logfile entries with wal_debug and debug_level set to 2 after upgrade to 7.1.3: ---------------------------------snip--------------------------------------- INSERT @ 30/1818723404: prev 30/1818723368; xprev 30/0; xid 162572613: Heap - insert: node 1057183/1256672; tid 16173/4 INSERT @ 30/1818725464: prev 30/1818723404; xprev 30/1818723404; xid 162572613: Btree - insert: node 1057183/1256684; tid 7812/421 INSERT @ 30/1818725524: prev 30/1818725464; xprev 30/1818725464; xid 162572613: Heap - insert: node 1057183/1256672; tid 16174/1 INSERT @ 30/1818727500: prev 30/1818725524; xprev 30/1818725524; xid 162572613: Btree - insert: node 1057183/1256684; tid 7812/421 INSERT @ 30/1818727560: prev 30/1818727500; xprev 30/1818727500; xid 162572613: Heap - insert: node 1057183/1256657; tid 435/55 INSERT @ 30/1818727692: prev 30/1818727560; xprev 30/1818727560; xid 162572613: Btree - insert: node 1057183/1256687; tid 3239/462 INSERT @ 30/1818727752: prev 30/1818727692; xprev 30/1818727692; xid 162572613: Btree - insert: node 1057183/1256690; tid 1/2 INSERT @ 30/1818727816: prev 30/1818727752; xprev 30/1818727752; xid 162572613: Btree - insert: node 1057183/1256693; tid 530/8 INSERT @ 30/1818727896: prev 30/1818727816; xprev 30/1818727816; xid 162572613: Btree - insert: node 1057183/1256696; tid 4034/382 2002-01-30 12:41:01 DEBUG: CommitTransactionCommand INSERT @ 30/1818727960: prev 30/1818727896; xprev 30/1818727896; xid 162572613: Transaction - commit: 2002-01-30 12:41:01 XLogFlush: rqst 30/1818727996; wrt 30/1818723404; flsh 30/1818723404 postmaster: reaping dead processes... postmaster: CleanupProc: pid 5060 exited with status 11 Server process (pid 5060) exited with status 11 at Wed Jan 30 12:41:01 2002 Terminating any active server processes... postmaster: CleanupProc: sending SIGQUIT to process 5186 ---------------------------------snip--------------------------------------- INSERT @ 30/1819347080: prev 30/1819347044; xprev 30/0; xid 162573078: Heap - insert: node 1057183/1256672; tid 16184/3 INSERT @ 30/1819349140: prev 30/1819347080; xprev 30/1819347080; xid 162573078: Btree - insert: node 1057183/1256684; tid 7812/466 INSERT @ 30/1819349200: prev 30/1819349140; xprev 30/1819349140; xid 162573078: Heap - insert: node 1057183/1256672; tid 16184/4 INSERT @ 30/1819350832: prev 30/1819349200; xprev 30/1819349200; xid 162573078: Btree - insert: node 1057183/1256684; tid 7812/466 INSERT @ 30/1819350892: prev 30/1819350832; xprev 30/1819350832; xid 162573078: Heap - insert: node 1057183/1256657; tid 436/10 INSERT @ 30/1819351032: prev 30/1819350892; xprev 30/1819350892; xid 162573078: Btree - insert: node 1057183/1256687; tid 3239/482 INSERT @ 30/1819351092: prev 30/1819351032; xprev 30/1819351032; xid 162573078: Btree - insert: node 1057183/1256690; tid 1/2 INSERT @ 30/1819351156: prev 30/1819351092; xprev 30/1819351092; xid 162573078; bkpb 1: Btree - insert: node 1057183/1256693; tid 75 2/3 INSERT @ 30/1819359428: prev 30/1819351156; xprev 30/1819351156; xid 162573078: Btree - insert: node 1057183/1256696; tid 4034/402 2002-01-30 12:41:44 DEBUG: CommitTransactionCommand INSERT @ 30/1819359492: prev 30/1819359428; xprev 30/1819359428; xid 162573078: Transaction - commit: 2002-01-30 12:41:44 XLogFlush: rqst 30/1819359528; wrt 30/1819347080; flsh 30/1819347080 2002-01-30 12:41:45 DEBUG: StartTransactionCommand 2002-01-30 12:41:45 DEBUG: query: BEGIN;ROLLBACK; 2002-01-30 12:41:45 DEBUG: ProcessUtility: BEGIN;ROLLBACK; 2002-01-30 12:41:45 DEBUG: CommitTransactionCommand 2002-01-30 12:41:45 DEBUG: StartTransactionCommand 2002-01-30 12:41:45 DEBUG: ProcessUtility: BEGIN;ROLLBACK; 2002-01-30 12:41:45 DEBUG: CommitTransactionCommand postmaster: reaping dead processes... postmaster: CleanupProc: pid 5203 exited with status 11 Server process (pid 5203) exited with status 11 at Wed Jan 30 12:41:45 2002 Terminating any active server processes... postmaster: CleanupProc: sending SIGQUIT to process 5211 postmaster: CleanupProc: sending SIGQUIT to process 5210 ---------------------------------snip--------------------------------------- INSERT @ 30/1821302728: prev 30/1821302692; xprev 30/0; xid 162575289: Heap - insert: node 1057183/1256672; tid 16222/1 INSERT @ 30/1821304800: prev 30/1821302728; xprev 30/1821302728; xid 162575289: Btree - insert: node 1057183/1256684; tid 7813/382 INSERT @ 30/1821304860: prev 30/1821304800; xprev 30/1821304800; xid 162575289: Heap - insert: node 1057183/1256672; tid 16222/2 INSERT @ 30/1821306724: prev 30/1821304860; xprev 30/1821304860; xid 162575289: Btree - insert: node 1057183/1256684; tid 7813/382 INSERT @ 30/1821306784: prev 30/1821306724; xprev 30/1821306724; xid 162575289: Heap - insert: node 1057183/1256657; tid 437/5 INSERT @ 30/1821306932: prev 30/1821306784; xprev 30/1821306784; xid 162575289: Btree - insert: node 1057183/1256687; tid 3240/305 INSERT @ 30/1821306992: prev 30/1821306932; xprev 30/1821306932; xid 162575289: Btree - insert: node 1057183/1256690; tid 1/2 INSERT @ 30/1821307056: prev 30/1821306992; xprev 30/1821306992; xid 162575289; bkpb 1: Btree - insert: node 1057183/1256693; tid 13 52/7 INSERT @ 30/1821315328: prev 30/1821307056; xprev 30/1821307056; xid 162575289: Btree - insert: node 1057183/1256696; tid 4035/273 2002-01-30 12:43:15 DEBUG: CommitTransactionCommand INSERT @ 30/1821315392: prev 30/1821315328; xprev 30/1821315328; xid 162575289: Transaction - commit: 2002-01-30 12:43:15 XLogFlush: rqst 30/1821315428; wrt 30/1821302728; flsh 30/1821302728 2002-01-30 12:43:16 DEBUG: StartTransactionCommand 2002-01-30 12:43:16 DEBUG: query: BEGIN;ROLLBACK; 2002-01-30 12:43:16 DEBUG: ProcessUtility: BEGIN;ROLLBACK; 2002-01-30 12:43:16 DEBUG: CommitTransactionCommand 2002-01-30 12:43:16 DEBUG: StartTransactionCommand 2002-01-30 12:43:16 DEBUG: ProcessUtility: BEGIN;ROLLBACK; 2002-01-30 12:43:16 DEBUG: CommitTransactionCommand postmaster: reaping dead processes... postmaster: CleanupProc: pid 5220 exited with status 11 Server process (pid 5220) exited with status 11 at Wed Jan 30 12:43:16 2002 Terminating any active server processes... postmaster: CleanupProc: sending SIGQUIT to process 5279 postmaster: CleanupProc: sending SIGQUIT to process 5278 ---------------------------------snip--------------------------------------- 2002-01-30 12:45:06 DEBUG: ProcessQuery 2002-01-30 12:45:06 DEBUG: CommitTransactionCommand XLogFlush: rqst 30/1821796188; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821776284; wrt 30/1824031200; flsh 30/1824031200 XLogFlush: rqst 30/1821809412; wrt 30/1823786416; flsh 30/1823786416 2002-01-30 12:45:06 DEBUG: CommitTransactionCommand XLogFlush: rqst 30/1821811520; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821822152; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821848796; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821861820; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821884244; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821896940; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821900460; wrt 30/1823786416; flsh 30/1823786416 XLogFlush: rqst 30/1821910400; wrt 30/1823786416; flsh 30/1823786416 postmaster: reaping dead processes... postmaster: CleanupProc: pid 5375 exited with status 11 Server process (pid 5375) exited with status 11 at Wed Jan 30 12:45:06 2002 Terminating any active server processes... postmaster: CleanupProc: sending SIGQUIT to process 5398 postmaster: CleanupProc: sending SIGQUIT to process 5397 ---------------------------------snip--------------------------------------- Is this due to a corrupted record? If yes, how can I safely delete that one? Hope you can help me. Thanks in advance. greetings, Bjoern
"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes: > Is this due to a corrupted record? If yes, how can I safely delete that one? Hard to tell, though it seems like a good bet. A stack backtrace from the core file produced by one of the crashed backends would tell us more. If you don't see any core file in the $PGDATA/base/yourdb/ directory, restart the postmaster with "ulimit -c unlimited" environment to allow core dumps to occur. regards, tom lane
Re: process exited with status 11 after XLogFlush: request is not satisfied
From
"Bjoern Metzdorf"
Date:
Hi Tom, > Hard to tell, though it seems like a good bet. A stack backtrace from > the core file produced by one of the crashed backends would tell us > more. Ok, i have a core file now, running with gdb 5.1.1, this is what comes: -----------------------snip-------------------------- Core was generated by `postgres: espl espl 213.148.1'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 #0 0x08066109 in nocachegetattr () (gdb) bt #0 0x08066109 in nocachegetattr () #1 0x080c0bfc in ExecEvalVar () #2 0x080c16bb in ExecEvalExpr () #3 0x080c179d in ExecEvalExpr () #4 0x080c10c8 in ExecEvalFuncArgs () #5 0x080c1144 in ExecMakeFunctionResult () #6 0x080c1346 in ExecEvalFunc () #7 0x080c1740 in ExecEvalExpr () #8 0x080c10c8 in ExecEvalFuncArgs () #9 0x080c1144 in ExecMakeFunctionResult () #10 0x080c1346 in ExecEvalFunc () #11 0x080c1740 in ExecEvalExpr () #12 0x080c10c8 in ExecEvalFuncArgs () #13 0x080c1144 in ExecMakeFunctionResult () #14 0x080c12ee in ExecEvalOper () #15 0x080c1730 in ExecEvalExpr () #16 0x080c1856 in ExecQual () #17 0x080c1d26 in ExecScan () #18 0x080c742a in ExecSeqScan () #19 0x080c03b9 in ExecProcNode () #20 0x080c7bba in ExecSort () #21 0x080c0449 in ExecProcNode () #22 0x080c6406 in ExecMergeJoin () #23 0x080c0409 in ExecProcNode () #24 0x080bf37d in ExecutePlan () #25 0x080bea57 in ExecutorRun () #26 0x0810579b in ProcessQuery () #27 0x081041c1 in pg_exec_query_string () #28 0x08105254 in PostgresMain () #29 0x080efe34 in DoBackend () #30 0x080ef9ec in BackendStartup () #31 0x080eec19 in ServerLoop () #32 0x080ee603 in PostmasterMain () #33 0x080cebfd in main () #34 0x400a2a42 in __libc_start_main () from /lib/libc.so.6 -----------------------snip-------------------------- System: Linux 2.2.17-SMP, PG 7.1.3 Right now the symptomps have changed, we don't get any "status 11" anymore, but LOTS of "Server process (pid 2644) exited with status 139 at Wed Jan 30 17:26:50 2002". When we disable access to db espl, the server runs fine, so I assume a corrupted record in one of the espl-tables. Any hints how to find it? greetings Bjoern
"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes: > (gdb) bt > #0 0x08066109 in nocachegetattr () > #1 0x080c0bfc in ExecEvalVar () > #2 0x080c16bb in ExecEvalExpr () > #3 0x080c179d in ExecEvalExpr () > #4 0x080c10c8 in ExecEvalFuncArgs () Yeah, that does look like a corrupted-data problem. If you wanted to rebuild with debugging symbols turned on, it might be possible to extract the location of the bad tuple directly from the corefile. Short of that, what I'd do is find out what query the backend is crashing on (turn on debug query logging), and then investigate the tables involved using queries like "select ctid,* from foo limit N". By varying the limit you can home in on just where the bad tuple is. regards, tom lane
Re: process exited with status 11 after XLogFlush: request is not satisfied
From
"Bjoern Metzdorf"
Date:
Hi, > Yeah, that does look like a corrupted-data problem. If you wanted to > rebuild with debugging symbols turned on, it might be possible to > extract the location of the bad tuple directly from the corefile. > Short of that, what I'd do is find out what query the backend is > crashing on (turn on debug query logging), and then investigate the > tables involved using queries like "select ctid,* from foo limit N". > By varying the limit you can home in on just where the bad tuple is. I tried the "select ctid,* from table limit N"-way and found some possible corrupted ctid. I then deleted all tuples in this ctid manually. It all looked good, but no, the problem persists. 5 minutes ago I did a "select ctid,* from table order by id limit 82000" and it worked flawlessly. Then I did a "select ctid,* from table order by id limit 200000" and it crashed again. I again tried "select ctid,* from table order by id limit 82000" and it crashed here also. What do you suspect here? Hardware failure? I ran a badblocks read-only test and it was fine. I tested memory with memtester and it was fine... How can this kind of corruption happen at all? And what can I do next? greetings, Bjoern
"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes: > It all looked good, but no, the problem persists. > 5 minutes ago I did a > "select ctid,* from table order by id limit 82000" > and it worked flawlessly. > Then I did a > "select ctid,* from table order by id limit 200000" > and it crashed again. > I again tried > "select ctid,* from table order by id limit 82000" > and it crashed here also. > What do you suspect here? Hardware failure? Given the above facts, that's what I suspect. You may have something like a disk controller that sometimes passes bad data. Or a bad RAM chip that sometimes drops bits. If it weren't a flaky-hardware kind of thing I'd expect the results to be reproducible. > I ran a badblocks read-only test > and it was fine. I tested memory with memtester and it was fine... Can anyone suggest other diagnostic tools to try? I don't think either of the above tests would be likely to notice a disk controller that sometimes returns wrong data, for example. regards, tom lane
"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes: > If we now do a > "select ctid,oid,* from table order by id limit 1,81274;" (which works > flawlessly) > we see, that this tuple is in ctid (5864,12). One thing to watch out for is that I believe "select ... limit" fetches one more tuple than it actually needs to return. Normally this doesn't matter but with a crashing issue it might. I would recommend using select ctid,* from table where ctid = '(page,item)' to make absolutely sure that you know which ctid is the bad tuple (if no crash, you got the wrong one), and then delete by ctid to get rid of it. It's entirely possible that several tuples on the same page are all corrupted, which is the main reason why ctid is so helpful in this situation --- you can see directly whether the tuples are in the same page or not. But I wouldn't delete stuff at random. If you can select it without a crash then it's OK. I would only consider using dd if it looked like the page header is broken (symptom: page appears to contain many more tuples than could possibly fit, and all but the first few are obviously completely garbage). regards, tom lane
Re: process exited with status 11 after XLogFlush: request is not satisfied
From
"Bjoern Metzdorf"
Date:
Hi Tom, > Given the above facts, that's what I suspect. You may have something > like a disk controller that sometimes passes bad data. Or a bad RAM > chip that sometimes drops bits. If it weren't a flaky-hardware kind > of thing I'd expect the results to be reproducible. We are indeed having some abnormal phenomenons with this controller, especially during boot process. So we stopped postmaster, tarred everything up and moved it to another machine, which has proven to be stable as hell. Then we did the select-tests again: "select ctid,* from table order by id limit 81273;" works fine. "select ctid,* from table order by id limit 81274;" crashes. so far so good, we copied the inconsistency over to the other machine, ok. Up to now we did not encounter irreproducible problems, with 81273 it works, with +1 not. (sidenote: "select ctid,oid,* from table order by id limit 81274;" works flawlessly, dunno why..) On the old machine, the one with the probable corrupt controller, we deleted all entries in ctid's 5858 to 5862 (not with "dd" but with "delete from table where ctid in ('(5858,0)','(5858,1)', ...)"). If we now do a "select ctid,oid,* from table order by id limit 1,81274;" (which works flawlessly) we see, that this tuple is in ctid (5864,12). We could of course now delete everything from ctid 5864 and perhaps 5865 and 5866 and so on, but is this the right way? Is deletion with delete-statements ok? Or do we have to use dd (in which case I was very grateful if someone gave me some hints) to zero out this blocks? What else can we do? greetings, Bjoern
I believe that you can return CURSORS from plpgsql functions however i have not been able to figure out exactly how. Anybody have any ideas ??? Darren Ferguson
On Thu, 31 Jan 2002, Darren Ferguson wrote: > I believe that you can return CURSORS from plpgsql functions however i > have not been able to figure out exactly how. > > Anybody have any ideas ??? create table aa(a int, b int, c int); create function f() returns refcursor as ' declare r refcursor; begin open r for select * from aa; return r; end;' language 'plpgsql';
Re: process exited with status 11 after XLogFlush: request is not satisfied
From
"Bjoern Metzdorf"
Date:
Hi Tom, > One thing to watch out for is that I believe "select ... limit" fetches > one more tuple than it actually needs to return. Normally this doesn't > matter but with a crashing issue it might. I would recommend using Good to know. In this case tuple N+1 is in the very same page as N. > select ctid,* from table where ctid = '(page,item)' > > to make absolutely sure that you know which ctid is the bad tuple (if no > crash, you got the wrong one), and then delete by ctid to get rid of it. This is something I don't understand: A "select ctid,* from table order by id limit 81273;" works, a "select ctid,* from players order by id limit 81274;" crashes, a "select ctid,* from players order by id limit 1,81274;" crashes, a "select ctid,oid,* from players order by id limit 1,81274;" works (gives back 5864,12 for ctid), and a "select ctid,* from players where ctid = '(5864,12)';" works also. If I do the same "select where ctid"-test on all ctid's of neighboured tuples ("select ctid from players order by id limit 20,81270;") it is the same (works). (btw, if I do "select ctid from players order by id limit 20,81270;", I get the desired output BUT the postmaster crashes (status 139)). The only way to reproduce the crash is by using limit. I'm doing these tests now again without the "order by id". > It's entirely possible that several tuples on the same page are all > corrupted, which is the main reason why ctid is so helpful in this > situation --- you can see directly whether the tuples are in the same > page or not. But I wouldn't delete stuff at random. If you can select > it without a crash then it's OK. > > I would only consider using dd if it looked like the page header is > broken (symptom: page appears to contain many more tuples than could > possibly fit, and all but the first few are obviously completely > garbage). Greetings, Bjoern
"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes: > If I do the same "select where ctid"-test on all ctid's of neighboured > tuples ("select ctid from players order by id limit 20,81270;") it is the > same (works). Um, the "order by id" is not accomplishing much here except to confuse the issue. Tuples adjacent in id order are not necessarily anywhere near each other physically. Why don't you just do select ctid,* from players limit N Since this is a heapscan it will return the results in ctid order. An offset isn't going to do much except confuse the issue either, because the tuples skipped by the offset are still fetched and might still provoke a crash. Don't forget also that you *must* select the columns. "select ctid from..." alone will probably not crash no matter how badly the body of the tuple is mangled, 'cause it ain't gonna look at the body. What I'd do to locate the bad tuple(s) is 1. Home in on an approximate value of N that separates where select * from players limit N crashes vs. where it doesn't crash. 2. Get the ctid's of the tuples around this point, eg select ctid from players offset N-100 limit 200 As indicated, you can probably retrieve the ctids even if the tuples are busted. 3. Probe these tuples individually using select * from players where ctid = '(m,n)'; Or you could just set up a script that does select ctid from players and then tries to retrieve each one individually by ctid. A tad tedious but easily automated. If you suspect there might be corruption in several different areas of the table then you'll end up doing this anyway. regards, tom lane