Thread: process exited with status 11 after XLogFlush: request is not satisfied

process exited with status 11 after XLogFlush: request is not satisfied

From
"Bjoern Metzdorf"
Date:
Hello,

I'm running postgres 7.1.2 / 7.1.3 on Linux 2.2.17 (Debian Potato).

Tonight we were suffering from

FATAL 2:  XLogFlush: request is not satisfied

in the logfile and crashes afterwards:

Server process (pid 26888) exited with status 512 at Wed Jan 30 09:51:37
2002

This was with 7.1.2. As stated in the archives I ran pg_resetxlog on the
data directory (after backup).
After that, the "XLogFlush: request is not satisfied" vanished, but now we
are suffering from several "Server process (pid 5060) exited with status
11".

At first I suspected harddrive or memory problems, but the disks (ICP Vortex
Raid 5) seem to be fine, as well as the memory (I tested with memtester:
http://www.qcc.sk.ca/~charlesc/software/memtester/)

Then I upgraded to 7.1.3, but the problem persists, here are the exact
logfile entries with wal_debug and debug_level set to 2 after upgrade to
7.1.3:

---------------------------------snip---------------------------------------

INSERT @ 30/1818723404: prev 30/1818723368; xprev 30/0; xid 162572613:
Heap - insert: node 1057183/1256672; tid 16173/4
INSERT @ 30/1818725464: prev 30/1818723404; xprev 30/1818723404; xid
162572613: Btree - insert: node 1057183/1256684; tid 7812/421
INSERT @ 30/1818725524: prev 30/1818725464; xprev 30/1818725464; xid
162572613: Heap - insert: node 1057183/1256672; tid 16174/1
INSERT @ 30/1818727500: prev 30/1818725524; xprev 30/1818725524; xid
162572613: Btree - insert: node 1057183/1256684; tid 7812/421
INSERT @ 30/1818727560: prev 30/1818727500; xprev 30/1818727500; xid
162572613: Heap - insert: node 1057183/1256657; tid 435/55
INSERT @ 30/1818727692: prev 30/1818727560; xprev 30/1818727560; xid
162572613: Btree - insert: node 1057183/1256687; tid 3239/462
INSERT @ 30/1818727752: prev 30/1818727692; xprev 30/1818727692; xid
162572613: Btree - insert: node 1057183/1256690; tid 1/2
INSERT @ 30/1818727816: prev 30/1818727752; xprev 30/1818727752; xid
162572613: Btree - insert: node 1057183/1256693; tid 530/8
INSERT @ 30/1818727896: prev 30/1818727816; xprev 30/1818727816; xid
162572613: Btree - insert: node 1057183/1256696; tid 4034/382
2002-01-30 12:41:01 DEBUG:  CommitTransactionCommand
INSERT @ 30/1818727960: prev 30/1818727896; xprev 30/1818727896; xid
162572613: Transaction - commit: 2002-01-30 12:41:01
XLogFlush: rqst 30/1818727996; wrt 30/1818723404; flsh 30/1818723404
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 5060 exited with status 11
Server process (pid 5060) exited with status 11 at Wed Jan 30 12:41:01 2002
Terminating any active server processes...
postmaster: CleanupProc: sending SIGQUIT to process 5186

---------------------------------snip---------------------------------------

INSERT @ 30/1819347080: prev 30/1819347044; xprev 30/0; xid 162573078:
Heap - insert: node 1057183/1256672; tid 16184/3
INSERT @ 30/1819349140: prev 30/1819347080; xprev 30/1819347080; xid
162573078: Btree - insert: node 1057183/1256684; tid 7812/466
INSERT @ 30/1819349200: prev 30/1819349140; xprev 30/1819349140; xid
162573078: Heap - insert: node 1057183/1256672; tid 16184/4
INSERT @ 30/1819350832: prev 30/1819349200; xprev 30/1819349200; xid
162573078: Btree - insert: node 1057183/1256684; tid 7812/466
INSERT @ 30/1819350892: prev 30/1819350832; xprev 30/1819350832; xid
162573078: Heap - insert: node 1057183/1256657; tid 436/10
INSERT @ 30/1819351032: prev 30/1819350892; xprev 30/1819350892; xid
162573078: Btree - insert: node 1057183/1256687; tid 3239/482
INSERT @ 30/1819351092: prev 30/1819351032; xprev 30/1819351032; xid
162573078: Btree - insert: node 1057183/1256690; tid 1/2
INSERT @ 30/1819351156: prev 30/1819351092; xprev 30/1819351092; xid
162573078; bkpb 1: Btree - insert: node 1057183/1256693; tid 75
2/3
INSERT @ 30/1819359428: prev 30/1819351156; xprev 30/1819351156; xid
162573078: Btree - insert: node 1057183/1256696; tid 4034/402
2002-01-30 12:41:44 DEBUG:  CommitTransactionCommand
INSERT @ 30/1819359492: prev 30/1819359428; xprev 30/1819359428; xid
162573078: Transaction - commit: 2002-01-30 12:41:44
XLogFlush: rqst 30/1819359528; wrt 30/1819347080; flsh 30/1819347080
2002-01-30 12:41:45 DEBUG:  StartTransactionCommand
2002-01-30 12:41:45 DEBUG:  query: BEGIN;ROLLBACK;
2002-01-30 12:41:45 DEBUG:  ProcessUtility: BEGIN;ROLLBACK;
2002-01-30 12:41:45 DEBUG:  CommitTransactionCommand
2002-01-30 12:41:45 DEBUG:  StartTransactionCommand
2002-01-30 12:41:45 DEBUG:  ProcessUtility: BEGIN;ROLLBACK;
2002-01-30 12:41:45 DEBUG:  CommitTransactionCommand
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 5203 exited with status 11
Server process (pid 5203) exited with status 11 at Wed Jan 30 12:41:45 2002
Terminating any active server processes...
postmaster: CleanupProc: sending SIGQUIT to process 5211
postmaster: CleanupProc: sending SIGQUIT to process 5210

---------------------------------snip---------------------------------------

INSERT @ 30/1821302728: prev 30/1821302692; xprev 30/0; xid 162575289:
Heap - insert: node 1057183/1256672; tid 16222/1
INSERT @ 30/1821304800: prev 30/1821302728; xprev 30/1821302728; xid
162575289: Btree - insert: node 1057183/1256684; tid 7813/382
INSERT @ 30/1821304860: prev 30/1821304800; xprev 30/1821304800; xid
162575289: Heap - insert: node 1057183/1256672; tid 16222/2
INSERT @ 30/1821306724: prev 30/1821304860; xprev 30/1821304860; xid
162575289: Btree - insert: node 1057183/1256684; tid 7813/382
INSERT @ 30/1821306784: prev 30/1821306724; xprev 30/1821306724; xid
162575289: Heap - insert: node 1057183/1256657; tid 437/5
INSERT @ 30/1821306932: prev 30/1821306784; xprev 30/1821306784; xid
162575289: Btree - insert: node 1057183/1256687; tid 3240/305
INSERT @ 30/1821306992: prev 30/1821306932; xprev 30/1821306932; xid
162575289: Btree - insert: node 1057183/1256690; tid 1/2
INSERT @ 30/1821307056: prev 30/1821306992; xprev 30/1821306992; xid
162575289; bkpb 1: Btree - insert: node 1057183/1256693; tid 13
52/7
INSERT @ 30/1821315328: prev 30/1821307056; xprev 30/1821307056; xid
162575289: Btree - insert: node 1057183/1256696; tid 4035/273
2002-01-30 12:43:15 DEBUG:  CommitTransactionCommand
INSERT @ 30/1821315392: prev 30/1821315328; xprev 30/1821315328; xid
162575289: Transaction - commit: 2002-01-30 12:43:15
XLogFlush: rqst 30/1821315428; wrt 30/1821302728; flsh 30/1821302728
2002-01-30 12:43:16 DEBUG:  StartTransactionCommand
2002-01-30 12:43:16 DEBUG:  query: BEGIN;ROLLBACK;
2002-01-30 12:43:16 DEBUG:  ProcessUtility: BEGIN;ROLLBACK;
2002-01-30 12:43:16 DEBUG:  CommitTransactionCommand
2002-01-30 12:43:16 DEBUG:  StartTransactionCommand
2002-01-30 12:43:16 DEBUG:  ProcessUtility: BEGIN;ROLLBACK;
2002-01-30 12:43:16 DEBUG:  CommitTransactionCommand
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 5220 exited with status 11
Server process (pid 5220) exited with status 11 at Wed Jan 30 12:43:16 2002
Terminating any active server processes...
postmaster: CleanupProc: sending SIGQUIT to process 5279
postmaster: CleanupProc: sending SIGQUIT to process 5278

---------------------------------snip---------------------------------------

2002-01-30 12:45:06 DEBUG:  ProcessQuery
2002-01-30 12:45:06 DEBUG:  CommitTransactionCommand
XLogFlush: rqst 30/1821796188; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821776284; wrt 30/1824031200; flsh 30/1824031200
XLogFlush: rqst 30/1821809412; wrt 30/1823786416; flsh 30/1823786416
2002-01-30 12:45:06 DEBUG:  CommitTransactionCommand
XLogFlush: rqst 30/1821811520; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821822152; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821848796; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821861820; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821884244; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821896940; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821900460; wrt 30/1823786416; flsh 30/1823786416
XLogFlush: rqst 30/1821910400; wrt 30/1823786416; flsh 30/1823786416
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 5375 exited with status 11
Server process (pid 5375) exited with status 11 at Wed Jan 30 12:45:06 2002
Terminating any active server processes...
postmaster: CleanupProc: sending SIGQUIT to process 5398
postmaster: CleanupProc: sending SIGQUIT to process 5397

---------------------------------snip---------------------------------------

Is this due to a corrupted record? If yes, how can I safely delete that one?

Hope you can help me.

Thanks in advance.

greetings,

Bjoern




"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes:
> Is this due to a corrupted record? If yes, how can I safely delete that one?

Hard to tell, though it seems like a good bet.  A stack backtrace from
the core file produced by one of the crashed backends would tell us
more.

If you don't see any core file in the $PGDATA/base/yourdb/ directory,
restart the postmaster with "ulimit -c unlimited" environment to allow
core dumps to occur.

            regards, tom lane

Re: process exited with status 11 after XLogFlush: request is not satisfied

From
"Bjoern Metzdorf"
Date:
Hi Tom,

> Hard to tell, though it seems like a good bet.  A stack backtrace from
> the core file produced by one of the crashed backends would tell us
> more.

Ok, i have a core file now, running with gdb 5.1.1, this is what comes:

-----------------------snip--------------------------

Core was generated by `postgres: espl espl 213.148.1'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libcrypt.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/ld-linux.so.2
#0  0x08066109 in nocachegetattr ()
(gdb) bt
#0  0x08066109 in nocachegetattr ()
#1  0x080c0bfc in ExecEvalVar ()
#2  0x080c16bb in ExecEvalExpr ()
#3  0x080c179d in ExecEvalExpr ()
#4  0x080c10c8 in ExecEvalFuncArgs ()
#5  0x080c1144 in ExecMakeFunctionResult ()
#6  0x080c1346 in ExecEvalFunc ()
#7  0x080c1740 in ExecEvalExpr ()
#8  0x080c10c8 in ExecEvalFuncArgs ()
#9  0x080c1144 in ExecMakeFunctionResult ()
#10 0x080c1346 in ExecEvalFunc ()
#11 0x080c1740 in ExecEvalExpr ()
#12 0x080c10c8 in ExecEvalFuncArgs ()
#13 0x080c1144 in ExecMakeFunctionResult ()
#14 0x080c12ee in ExecEvalOper ()
#15 0x080c1730 in ExecEvalExpr ()
#16 0x080c1856 in ExecQual ()
#17 0x080c1d26 in ExecScan ()
#18 0x080c742a in ExecSeqScan ()
#19 0x080c03b9 in ExecProcNode ()
#20 0x080c7bba in ExecSort ()
#21 0x080c0449 in ExecProcNode ()
#22 0x080c6406 in ExecMergeJoin ()
#23 0x080c0409 in ExecProcNode ()
#24 0x080bf37d in ExecutePlan ()
#25 0x080bea57 in ExecutorRun ()
#26 0x0810579b in ProcessQuery ()
#27 0x081041c1 in pg_exec_query_string ()
#28 0x08105254 in PostgresMain ()
#29 0x080efe34 in DoBackend ()
#30 0x080ef9ec in BackendStartup ()
#31 0x080eec19 in ServerLoop ()
#32 0x080ee603 in PostmasterMain ()
#33 0x080cebfd in main ()
#34 0x400a2a42 in __libc_start_main () from /lib/libc.so.6

-----------------------snip--------------------------

System: Linux 2.2.17-SMP, PG 7.1.3

Right now the symptomps have changed, we don't get any "status 11" anymore,
but LOTS of

"Server process (pid 2644) exited with status 139 at Wed Jan 30 17:26:50
2002".

When we disable access to db espl, the server runs fine, so I assume a
corrupted record in one of the espl-tables.

Any hints how to find it?

greetings

Bjoern



"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes:
> (gdb) bt
> #0  0x08066109 in nocachegetattr ()
> #1  0x080c0bfc in ExecEvalVar ()
> #2  0x080c16bb in ExecEvalExpr ()
> #3  0x080c179d in ExecEvalExpr ()
> #4  0x080c10c8 in ExecEvalFuncArgs ()

Yeah, that does look like a corrupted-data problem.  If you wanted to
rebuild with debugging symbols turned on, it might be possible to
extract the location of the bad tuple directly from the corefile.
Short of that, what I'd do is find out what query the backend is
crashing on (turn on debug query logging), and then investigate the
tables involved using queries like "select ctid,* from foo limit N".
By varying the limit you can home in on just where the bad tuple is.

            regards, tom lane

Re: process exited with status 11 after XLogFlush: request is not satisfied

From
"Bjoern Metzdorf"
Date:
Hi,

> Yeah, that does look like a corrupted-data problem.  If you wanted to
> rebuild with debugging symbols turned on, it might be possible to
> extract the location of the bad tuple directly from the corefile.
> Short of that, what I'd do is find out what query the backend is
> crashing on (turn on debug query logging), and then investigate the
> tables involved using queries like "select ctid,* from foo limit N".
> By varying the limit you can home in on just where the bad tuple is.

I tried the "select ctid,* from table limit N"-way and found some possible
corrupted ctid. I then deleted all tuples in this ctid manually.

It all looked good, but no, the problem persists.

5 minutes ago I did a

"select ctid,* from table order by id limit 82000"

and it worked flawlessly.

Then I did a

"select ctid,* from table order by id limit 200000"

and it crashed again.

I again tried

"select ctid,* from table order by id limit 82000"

and it crashed here also.

What do you suspect here? Hardware failure? I ran a badblocks read-only test
and it was fine. I tested memory with memtester and it was fine...

How can this kind of corruption happen at all? And what can I do next?

greetings,

Bjoern






"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes:
> It all looked good, but no, the problem persists.

> 5 minutes ago I did a

> "select ctid,* from table order by id limit 82000"

> and it worked flawlessly.

> Then I did a

> "select ctid,* from table order by id limit 200000"

> and it crashed again.

> I again tried

> "select ctid,* from table order by id limit 82000"

> and it crashed here also.

> What do you suspect here? Hardware failure?

Given the above facts, that's what I suspect.  You may have something
like a disk controller that sometimes passes bad data.  Or a bad RAM
chip that sometimes drops bits.  If it weren't a flaky-hardware kind
of thing I'd expect the results to be reproducible.

> I ran a badblocks read-only test
> and it was fine. I tested memory with memtester and it was fine...

Can anyone suggest other diagnostic tools to try?  I don't think either
of the above tests would be likely to notice a disk controller that
sometimes returns wrong data, for example.

            regards, tom lane

"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes:
> If we now do a

> "select ctid,oid,* from table order by id limit 1,81274;" (which works
> flawlessly)

> we see, that this tuple is in ctid (5864,12).

One thing to watch out for is that I believe "select ... limit" fetches
one more tuple than it actually needs to return.  Normally this doesn't
matter but with a crashing issue it might.  I would recommend using

    select ctid,* from table where ctid = '(page,item)'

to make absolutely sure that you know which ctid is the bad tuple (if no
crash, you got the wrong one), and then delete by ctid to get rid of it.

It's entirely possible that several tuples on the same page are all
corrupted, which is the main reason why ctid is so helpful in this
situation --- you can see directly whether the tuples are in the same
page or not.  But I wouldn't delete stuff at random.  If you can select
it without a crash then it's OK.

I would only consider using dd if it looked like the page header is
broken (symptom: page appears to contain many more tuples than could
possibly fit, and all but the first few are obviously completely
garbage).

            regards, tom lane

Re: process exited with status 11 after XLogFlush: request is not satisfied

From
"Bjoern Metzdorf"
Date:
Hi Tom,

> Given the above facts, that's what I suspect.  You may have something
> like a disk controller that sometimes passes bad data.  Or a bad RAM
> chip that sometimes drops bits.  If it weren't a flaky-hardware kind
> of thing I'd expect the results to be reproducible.

We are indeed having some abnormal phenomenons with this controller,
especially during boot process. So we stopped postmaster, tarred everything
up and moved it to another machine, which has proven to be stable as hell.

Then we did the select-tests again:

"select ctid,* from table order by id limit 81273;"

works fine.

"select ctid,* from table order by id limit 81274;"

crashes. so far so good, we copied the inconsistency over to the other
machine, ok. Up to now we did not encounter irreproducible problems, with
81273 it works, with +1 not.

(sidenote: "select ctid,oid,* from table order by id limit 81274;" works
flawlessly, dunno why..)

On the old machine, the one with the probable corrupt controller, we deleted
all entries in ctid's 5858 to 5862 (not with "dd" but with "delete from
table where ctid in ('(5858,0)','(5858,1)', ...)").

If we now do a

"select ctid,oid,* from table order by id limit 1,81274;" (which works
flawlessly)

we see, that this tuple is in ctid (5864,12).

We could of course now delete everything from ctid 5864 and perhaps 5865 and
5866 and so on, but is this the right way? Is deletion with
delete-statements ok? Or do we have to use dd (in which case I was very
grateful if someone gave me some hints) to zero out this blocks? What else
can we do?

greetings,

Bjoern







Returning a CURSOR from plpgsql functions

From
Darren Ferguson
Date:
I believe that you can return CURSORS from plpgsql functions however i
have not been able to figure out exactly how.

Anybody have any ideas ???

Darren Ferguson


Re: Returning a CURSOR from plpgsql functions

From
Stephan Szabo
Date:
On Thu, 31 Jan 2002, Darren Ferguson wrote:

> I believe that you can return CURSORS from plpgsql functions however i
> have not been able to figure out exactly how.
>
> Anybody have any ideas ???

create table aa(a int, b int, c int);
create function f() returns refcursor as '
declare
 r refcursor;
begin
 open r for select * from aa;
 return r;
end;' language 'plpgsql';




Re: process exited with status 11 after XLogFlush: request is not satisfied

From
"Bjoern Metzdorf"
Date:
Hi Tom,

> One thing to watch out for is that I believe "select ... limit" fetches
> one more tuple than it actually needs to return.  Normally this doesn't
> matter but with a crashing issue it might.  I would recommend using

Good to know. In this case tuple N+1 is in the very same page as N.

> select ctid,* from table where ctid = '(page,item)'
>
> to make absolutely sure that you know which ctid is the bad tuple (if no
> crash, you got the wrong one), and then delete by ctid to get rid of it.

This is something I don't understand:

A "select ctid,* from table order by id limit 81273;" works,
a "select ctid,* from players order by id limit 81274;" crashes,
a "select ctid,* from players order by id limit 1,81274;" crashes,
a "select ctid,oid,* from players order by id limit 1,81274;" works (gives
back 5864,12 for ctid),
and a "select ctid,* from players where ctid = '(5864,12)';" works also.

If I do the same "select where ctid"-test on all ctid's of neighboured
tuples ("select ctid from players order by id limit 20,81270;") it is the
same (works). (btw, if I do "select ctid from players order by id limit
20,81270;", I get the desired output BUT the postmaster crashes (status
139)).

The only way to reproduce the crash is by using limit. I'm doing these tests
now again without the "order by id".

> It's entirely possible that several tuples on the same page are all
> corrupted, which is the main reason why ctid is so helpful in this
> situation --- you can see directly whether the tuples are in the same
> page or not.  But I wouldn't delete stuff at random.  If you can select
> it without a crash then it's OK.
>
> I would only consider using dd if it looked like the page header is
> broken (symptom: page appears to contain many more tuples than could
> possibly fit, and all but the first few are obviously completely
> garbage).

Greetings,

Bjoern



"Bjoern Metzdorf" <bm@turtle-entertainment.de> writes:
> If I do the same "select where ctid"-test on all ctid's of neighboured
> tuples ("select ctid from players order by id limit 20,81270;") it is the
> same (works).

Um, the "order by id" is not accomplishing much here except to confuse
the issue.  Tuples adjacent in id order are not necessarily anywhere
near each other physically.  Why don't you just do

        select ctid,* from players limit N

Since this is a heapscan it will return the results in ctid order.

An offset isn't going to do much except confuse the issue either,
because the tuples skipped by the offset are still fetched and might
still provoke a crash.

Don't forget also that you *must* select the columns.  "select ctid from..."
alone will probably not crash no matter how badly the body of the tuple
is mangled, 'cause it ain't gonna look at the body.

What I'd do to locate the bad tuple(s) is

1. Home in on an approximate value of N that separates where

        select * from players limit N

crashes vs. where it doesn't crash.

2. Get the ctid's of the tuples around this point, eg

        select ctid from players offset N-100 limit 200

As indicated, you can probably retrieve the ctids even if the tuples
are busted.

3. Probe these tuples individually using

        select * from players where ctid = '(m,n)';


Or you could just set up a script that does
        select ctid from players
and then tries to retrieve each one individually by ctid.  A tad tedious
but easily automated.  If you suspect there might be corruption in
several different areas of the table then you'll end up doing this anyway.

            regards, tom lane