Thread: right sibling is not next child

right sibling is not next child

From
"Kevin Grittner"
Date:
I'm reporting this as a PostgreSQL bug because it involves an index
corruption.  I can't see any other way our application should be able to
corrupt an index.  I will attach the tail of the log when the corruption
was detected (and the postmaster shut itself down), as well as the
subsequent attempt to start.  Fortunately we run our web site off of a
farm of four database servers, so we are taking one of the others out of
the mix, stopping postmaster, and copying its data directory over to
this machine for recovery, so we don't need advice on that aspect of
things; but, we'd like to do what we can to help track down the cause to
prevent a recurrence.  We have renamed the data directory to make room
for recovery at the normal location, but otherwise the failing data
directory structure is unmodified.

For context, this is running on Windows 2003 Server.  Eight Xeon box,
no HT, 6 GB RAM, 13 drive 15,000 RPM RAID5 array through battery backed
controller for everything.  This database is about 180 GB with about 300
tables.  We have autovacuum running every ten seconds because of a few
very small tables with very high update rates, and we have a scheduled
VACUUM ANALYZE VERBOSE every night.  It appears that last night's vacuum
found the problem, which the previous night's vacuum didn't.  We had
some event which started at 14:25 yesterday which persisted until we
restarted the middle tier at 15:04.  The symptom was that a fraction of
the queries which normally run in a few ms were timing out on a 20
second limit.  pg_locks showed no blocking.  We've been getting episodes
with these symptoms occassionally, but they have only lasted a minute or
two; this duration was unusual.  We haven't identified a cause.  One odd
thing is that with the number of queries per second that we run, the
number of timeouts during an episode is too small to support the notion
that _all_ similar queries are failing.

How best to proceed?

-Kevin



Attachment

Re: right sibling is not next child

From
"Peter Brant"
Date:
Sorry about the delay in responding.  We had a bit of difficulty with
the test machine.  Kevin is also on vacation this week.

The problem is repeatable with a VACUUM.  I've found the offending
block.  A (partial) pg_filedump of that block is pasted in below.  I'm a
little lost as to what the next step is though.  Any pointers would be
appreciated.

Pete

"Kevin Grittner" <Kevin ( dot ) Grittner ( at ) wicourts ( dot ) gov>
writes:
> [2006-04-06 02:19:57.460 ] 3848 <postgres bigbird 127.0.0.1(3944)>
PANIC:  right sibling is not next child in "Panel_pkey"

This should be repeatable by re-attempting a VACUUM, right?  Please
find
out which page exactly it's unhappy about (either gdb the crash or add
a
printout of the "parent" variable to the elog call in nbtpage.c), then
pg_filedump the index and look to see what the index contains.

            regards, tom lane

PANIC:  right sibling is not next child in "Panel_pkey", parent is 271

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1
*
* File: 180571
* Options used: -R 271 -i
*
* Dump created on: Tue Apr 11 13:38:44 2006
*******************************************************************

Block  271 ********************************************************
<Header> -----
 Block Offset: 0x0021e000         Offsets: Lower     468 (0x01d4)
 Block: Size 8192  Version    3            Upper    1952 (0x07a0)
 LSN:  logid    246 recoff 0x265d47c0      Special  8176 (0x1ff0)
 Items:  112                   Free Space: 1484
 Length (including item array): 472

<Data> ------
 Item   1 -- Length:   56  Offset: 8120 (0x1fb8)  Flags: USED
  Block Id: 132  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item   2 -- Length:    8  Offset: 8112 (0x1fb0)  Flags: USED
  Block Id: 201  linp Index: 1  Size: 8
  Has Nulls: 0  Has Varwidths: 0

 Item   3 -- Length:   56  Offset: 8056 (0x1f78)  Flags: USED
  Block Id: 257  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

... snip ...

 Item 109 -- Length:   56  Offset: 4920 (0x1338)  Flags: USED
  Block Id: 121  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item 110 -- Length:   56  Offset: 4864 (0x1300)  Flags: USED
  Block Id: 133  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item 111 -- Length:   56  Offset: 4808 (0x12c8)  Flags: USED
  Block Id: 134  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item 112 -- Length:   56  Offset: 4752 (0x1290)  Flags: USED
  Block Id: 137  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384


<Special Section> -----
 BTree Index Section:
  Flags: 0x0000 ()
  Blocks: Previous (167)  Next (455)  Level (1)


*** End of Requested Range Encountered. Last Block Read: 271 ***

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> PANIC:  right sibling is not next child in "Panel_pkey", parent is 271

Hmm ... that's not actually enough info to tell us where to look,
is it :-(.  Please add the following variables to the elog message, or
gdb for them if you can:
    target
    rightsib
    nextoffset

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
Try as I might, I wasn't able to get a JIT debugger give me a memory
dump.  It seems like postgres.exe is not really crashing in the
"unhandled exception" sense (see gdb log below)?  Am I missing a
configure option?

(As an aside, what's the best way to get a core dump on Windows?  Can
gdb read memory dumps produced by NTSD?)

I was able to attach to the running backend using gdb.  That worked
(kind of, see log below).

I ended up modifying the elog again with the following results:

PANIC:  right sibling is not next child in "Panel_pkey", parent is 271,
target is 635, rightsib is 629, nextoffset is 87

C:\WINDOWS\system32>C:\mingw\bin\gdb C:\pgsql\bin\postgres.exe
... snip ...
(gdb) attach 2084
Attaching to program `C:\pgsql\bin\postgres.exe', process 2084
[Switching to thread 2084.0x930]
(gdb) break nbtpage.c:983
Breakpoint 1 at 0x41f042: file nbtpage.c, line 983.
(gdb) continue
Continuing.
[Switching to thread 2084.0x81c]

Breakpoint 1, _bt_pagedel (rel=0x196bbb0, buf=1089, vacuum_full=0
'\0')
    at nbtpage.c:983
983     nbtpage.c: No such file or directory.
        in nbtpage.c
(gdb) print parent
$1 = 271
(gdb) print target
$2 = 635
(gdb) print rightsib
No symbol "rightsib" in current context.
(gdb) print nextoffset
$3 = 87
(gdb) print leftsib
$4 = 636
(gdb) print rightsib
No symbol "rightsib" in current context.
(gdb) continue
Continuing.

Program exited with code 03.
(gdb)

Pete




>>> Tom Lane <tgl@sss.pgh.pa.us> 04/11/06 9:19 pm >>>
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> PANIC:  right sibling is not next child in "Panel_pkey", parent is
271

Hmm ... that's not actually enough info to tell us where to look,
is it :-(.  Please add the following variables to the elog message, or
gdb for them if you can:
    target
    rightsib
    nextoffset

            regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> I ended up modifying the elog again with the following results:
> PANIC:  right sibling is not next child in "Panel_pkey", parent is 271,
> target is 635, rightsib is 629, nextoffset is 87

OK, so the part of the pg_filedump info we need to see is items 86/87
in block 271, plus a few around them for good measure.

Are the values of the keys in this index sensitive data, or are they
just serial numbers of some kind?  If we could see pg_dump -i -f rather
than just -i, it might help.  I'd like to see the dumps of pages 635,
636, 629 as well (mostly their "special space" contents).

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
The index data isn't sensitive, but I should ask for permission
nonetheless.  I'll send over the '-f' output tomorrow morning.

Pete


*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1
*
* File: 180571
* Options used: -i -R 271
*
* Dump created on: Tue Apr 11 18:22:24 2006
*******************************************************************

Block  271 ********************************************************
<Header> -----
 Block Offset: 0x0021e000         Offsets: Lower     468 (0x01d4)
 Block: Size 8192  Version    3            Upper    1952 (0x07a0)
 LSN:  logid    246 recoff 0x265d47c0      Special  8176 (0x1ff0)
 Items:  112                   Free Space: 1484
 Length (including item array): 472

<Data> ------

... snip ...

 Item  80 -- Length:   56  Offset: 3352 (0x0d18)  Flags: USED
  Block Id: 581  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  81 -- Length:   56  Offset: 3128 (0x0c38)  Flags: USED
  Block Id: 580  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  82 -- Length:   56  Offset: 2848 (0x0b20)  Flags: USED
  Block Id: 606  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  83 -- Length:   56  Offset: 2736 (0x0ab0)  Flags: USED
  Block Id: 608  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  84 -- Length:   56  Offset: 2792 (0x0ae8)  Flags: USED
  Block Id: 601  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  85 -- Length:   56  Offset: 2120 (0x0848)  Flags: USED
  Block Id: 640  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  86 -- Length:   56  Offset: 2176 (0x0880)  Flags: USED
  Block Id: 635  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  87 -- Length:   56  Offset: 2232 (0x08b8)  Flags: USED
  Block Id: 636  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  88 -- Length:   56  Offset: 2288 (0x08f0)  Flags: USED
  Block Id: 635  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  89 -- Length:   56  Offset: 2400 (0x0960)  Flags: USED
  Block Id: 629  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  90 -- Length:   56  Offset: 5704 (0x1648)  Flags: USED
  Block Id: 166  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  91 -- Length:   56  Offset: 5648 (0x1610)  Flags: USED
  Block Id: 339  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  92 -- Length:   56  Offset: 5592 (0x15d8)  Flags: USED
  Block Id: 372  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  93 -- Length:   56  Offset: 4416 (0x1140)  Flags: USED
  Block Id: 480  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

 Item  94 -- Length:   56  Offset: 5536 (0x15a0)  Flags: USED
  Block Id: 208  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384

... snip ...

<Special Section> -----
 BTree Index Section:
  Flags: 0x0000 ()
  Blocks: Previous (167)  Next (455)  Level (1)


*** End of Requested Range Encountered. Last Block Read: 271 ***

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1
*
* File: 180571
* Options used: -i -R 635
*
* Dump created on: Tue Apr 11 18:22:43 2006
*******************************************************************

Block  635 ********************************************************
<Header> -----
 Block Offset: 0x004f6000         Offsets: Lower      24 (0x0018)
 Block: Size 8192  Version    3            Upper    8120 (0x1fb8)
 LSN:  logid    239 recoff 0xcabae788      Special  8176 (0x1ff0)
 Items:    1                   Free Space: 8096
 Length (including item array): 28

<Data> ------
 Item   1 -- Length:   56  Offset: 8120 (0x1fb8)  Flags: USED
  Block Id: 452  linp Index: 1  Size: 56
  Has Nulls: 0  Has Varwidths: 16384


<Special Section> -----
 BTree Index Section:
  Flags: 0x0001 (LEAF)
  Blocks: Previous (636)  Next (629)  Level (0)


*** End of Requested Range Encountered. Last Block Read: 635 ***

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1
*
* File: 180571
* Options used: -i -R 636
*
* Dump created on: Tue Apr 11 18:22:50 2006
*******************************************************************

Block  636 ********************************************************
<Header> -----
 Block Offset: 0x004f8000         Offsets: Lower      24 (0x0018)
 Block: Size 8192  Version    3            Upper    8120 (0x1fb8)
 LSN:  logid    245 recoff 0x9bfa3b60      Special  8176 (0x1ff0)
 Items:    1                   Free Space: 8096
 Length (including item array): 28

<Data> ------
 Item   1 -- Length:   56  Offset: 8120 (0x1fb8)  Flags: USED
  Block Id: 454  linp Index: 27  Size: 56
  Has Nulls: 0  Has Varwidths: 16384


<Special Section> -----
 BTree Index Section:
  Flags: 0x0001 (LEAF)
  Blocks: Previous (640)  Next (635)  Level (0)


*** End of Requested Range Encountered. Last Block Read: 636 ***

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1
*
* File: 180571
* Options used: -i -R 629
*
* Dump created on: Tue Apr 11 18:23:01 2006
*******************************************************************

Block  629 ********************************************************
<Header> -----
 Block Offset: 0x004ea000         Offsets: Lower     244 (0x00f4)
 Block: Size 8192  Version    3            Upper    5040 (0x13b0)
 LSN:  logid    239 recoff 0xcabae788      Special  8176 (0x1ff0)
 Items:   56                   Free Space: 4796
 Length (including item array): 248

<Data> ------
... snip ...

<Special Section> -----
 BTree Index Section:
  Flags: 0x0001 (LEAF)
  Blocks: Previous (635)  Next (166)  Level (0)


*** End of Requested Range Encountered. Last Block Read: 629 ***



>>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 1:06 am >>>
OK, so the part of the pg_filedump info we need to see is items 86/87
in block 271, plus a few around them for good measure.

Are the values of the keys in this index sensitive data, or are they
just serial numbers of some kind?  If we could see pg_dump -i -f
rather
than just -i, it might help.  I'd like to see the dumps of pages 635,
636, 629 as well (mostly their "special space" contents).

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it
actually doesn't even get to Panel and errors out with:

INFO:  analyzing "public.MaintCode"
INFO:  "MaintCode": scanned 1 of 1 pages, containing 19 live rows and 0
dead rows; 19 rows in sample, 19 estimated total rows
ERROR:  duplicate key violates unique constraint
"pg_statistic_relid_att_index"

MaintCode is a tiny, static table.  Does that indicate further
corruption or is there a more benign explanation?  Could these errors
not be Postgres' fault? (e.g. hardware-related)

Pete

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it
> actually doesn't even get to Panel and errors out with:

> ERROR:  duplicate key violates unique constraint
> "pg_statistic_relid_att_index"

Hm, my eyebrows just disappeared over the back of my head somewhere.
Is that repeatable?  Does a REINDEX on pg_statistic make it go away?

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
It is repeatable.  A reindex doesn't work.

Pete

bigbird=# vacuum analyze "MaintCode";
ERROR:  duplicate key violates unique constraint
"pg_statistic_relid_att_index"


bigbird=# vacuum analyze verbose "MaintCode";
INFO:  vacuuming "public.MaintCode"
INFO:  index "MaintCode_pkey" now contains 19 row versions in 2 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "MaintCode": found 0 removable, 19 nonremovable row versions in
1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  vacuuming "pg_toast.pg_toast_90472"
INFO:  index "pg_toast_90472_index" now contains 0 row versions in 1
pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_90472": found 0 removable, 0 nonremovable row versions
in 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.MaintCode"
INFO:  "MaintCode": scanned 1 of 1 pages, containing 19 live rows and 0
dead rows; 19 rows in sample, 19 estimated total rows
ERROR:  duplicate key violates unique constraint
"pg_statistic_relid_att_index"


bigbird=# reindex table pg_statistic;
ERROR:  could not create unique index
DETAIL:  Table contains duplicated values.


bigbird=# select * from "MaintCode";
 maintCode | colorAttrb |           descr           | isActive |
maintPriority | stopsTheClock | videoAttrb | wcisSpecStatus

-----------+------------+---------------------------+----------+---------------+---------------+------------+----------------
 AA        | R          | Annual accounts           | t        |
    75 | t             | N          |
 AP        | R          | Appeal                    | t        |
    30 | f             | N          |
 AW        | R          | Awaiting Papers           | t        |
    80 | f             | N          |
 BA        | R          | Bankruptcy                | t        |
    20 | t             | N          | S05
 CI        | R          | Case Inactive             | t        |
    60 | f             | N          | S03
 CS        | R          | Consent decree            | t        |
    90 | t             | N          |
 DP        | R          | Deferred judgt./prosecute | t        |
    40 | t             | N          | S07
 EF        | R          | Electronic filing         | t        |
    33 | f             | N          |
 ES        | R          | Extension                 | t        |
    55 | t             | N          |
 EX        | R          | Expungement               | t        |
     5 | f             | N          |
 FX        | R          | Future expungement        | t        |
    93 | f             | N          |
 IN        | R          | Interpreter               | t        |
    53 | f             | N          |
 JR        | R          | Judicial review           | t        |
    75 | t             | N          |
 RO        | R          | Reopen                    | t        |
    35 | f             | N          |
 SL        | R          | Sealed Record             | t        |
    85 | f             | N          |
 SM        | R          | Sentence modification     | t        |
    43 | f             | N          |
 SU        | R          | Case suspended            | t        |
    45 | t             | N          |
 UA        | R          | Under Advisement          | t        |
    50 | f             | N          | S02
 WA        | R          | Warrant issued            | t        |
    10 | t             | F          | S04
(19 rows)

bigbird=#



>>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 5:00 am >>>
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it
> actually doesn't even get to Panel and errors out with:

> ERROR:  duplicate key violates unique constraint
> "pg_statistic_relid_att_index"

Hm, my eyebrows just disappeared over the back of my head somewhere.
Is that repeatable?  Does a REINDEX on pg_statistic make it go away?

            regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> bigbird=# vacuum analyze "MaintCode";
> ERROR:  duplicate key violates unique constraint
> "pg_statistic_relid_att_index"

Hm, can you see any rows in pg_statistic with duplicate values of
(starelid, staattnum)?  If so it'd be useful to look at their
ctid/xmin/xmax/cmin/cmax values.

            regards, tom lane

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
>  Item  85 -- Length:   56  Offset: 2120 (0x0848)  Flags: USED
>   Block Id: 640  linp Index: 1  Size: 56
>   Has Nulls: 0  Has Varwidths: 16384

>  Item  86 -- Length:   56  Offset: 2176 (0x0880)  Flags: USED
>   Block Id: 635  linp Index: 1  Size: 56
>   Has Nulls: 0  Has Varwidths: 16384

>  Item  87 -- Length:   56  Offset: 2232 (0x08b8)  Flags: USED
>   Block Id: 636  linp Index: 1  Size: 56
>   Has Nulls: 0  Has Varwidths: 16384

>  Item  88 -- Length:   56  Offset: 2288 (0x08f0)  Flags: USED
>   Block Id: 635  linp Index: 1  Size: 56
>   Has Nulls: 0  Has Varwidths: 16384

>  Item  89 -- Length:   56  Offset: 2400 (0x0960)  Flags: USED
>   Block Id: 629  linp Index: 1  Size: 56
>   Has Nulls: 0  Has Varwidths: 16384

>  Item  90 -- Length:   56  Offset: 5704 (0x1648)  Flags: USED
>   Block Id: 166  linp Index: 1  Size: 56
>   Has Nulls: 0  Has Varwidths: 16384


Well, that's pretty dang interesting.  How did block 635 get to be
listed twice?  The sibling links say that the correct sequence is
640, 636, 635, 629, 166 ... so something screwed up the parent level's
keys.

What would be most useful at this point is to look at the keys in
these entries, and compare them to the "high keys" (item 1) of the
individual leaf pages.  I'm wondering what key is in that extra
entry for 635 ...  Did you get permission to show us the keys?

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
I can't find any duplicates?!?

The query

select starelid, staattnum, ctid, xmin, xmax, cmin, cmax
from pg_statistic p1
where (select count(*) from pg_statistic p2 where
  p1.starelid = p2.starelid and p1.staattnum = p2.staattnum) > 1

doesn't turn up anything.  Nor does dumping

select starelid, staattnum from pg_statistic

to a file and using sort | uniq -c

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 4:37 pm >>>
Hm, can you see any rows in pg_statistic with duplicate values of
(starelid, staattnum)?  If so it'd be useful to look at their
ctid/xmin/xmax/cmin/cmax values.

            regards, tom lane

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> I can't find any duplicates?!?

Weirder and weirder.  Maybe the table is OK but the index is corrupt?

Could it be another symptom of the same problem we're seeing in the
Panel_pkey index?  I'm currently theorizing that that index might've
been corrupted during WAL replay (specifically, if the replayer somehow
failed to recognize a split completion, btree_xlog_cleanup would've
inserted an extra entry) and one could certainly imagine multiple
indexes getting corrupted in the same fashion at the same time.  You
did say that this problem showed up shortly after a database crash,
right?

Could you send me a copy of pg_statistic_relid_att_index, off-list
(the actual disk file, not a pg_filedump)?  There's nothing but table
OIDs and attribute numbers in it, so I can't see any reason anyone
would consider it sensitive data.  I've got some really crude code
laying about for scanning an index and looking for inconsistencies ...
it's too ugly to give out, but I'd like to see if it can find anything
wrong with that index.  (I'll probably ask for a copy of Panel_pkey
for the same purpose, if you get permission to show us its keys.)

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
Per the DBAs, there hadn't been any recent crashes before last Thursday.
 A "vacuum analyze verbose" discovered the problem early Thursday
morning.  After the PANIC, the database never came back up (the
heap_clean_redo: no block / full_page_writes = off problem).

One thing that seems strange to me is that the original crash on
Thursday failed on Panel_pkey, but my "vacuum analyze verbose" on a copy
of the crashed database failed on MaintCode /
pg_statistic_relid_att_index.

I'll send over pg_statistic_relid_att_index and Panel_pkey.  Showing
the keys to Panel_pkey is no problem.

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 6:03 pm >>>
You did say that this problem showed up shortly after a database
crash,
right?

Could you send me a copy of pg_statistic_relid_att_index, off-list
(the actual disk file, not a pg_filedump)?  There's nothing but table
OIDs and attribute numbers in it, so I can't see any reason anyone
would consider it sensitive data.  I've got some really crude code
laying about for scanning an index and looking for inconsistencies ...
it's too ugly to give out, but I'd like to see if it can find anything
wrong with that index.  (I'll probably ask for a copy of Panel_pkey
for the same purpose, if you get permission to show us its keys.)

            regards, tom lane

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> One thing that seems strange to me is that the original crash on
> Thursday failed on Panel_pkey, but my "vacuum analyze verbose" on a copy
> of the crashed database failed on MaintCode /
> pg_statistic_relid_att_index.

I can't find anything particularly wrong with
pg_statistic_relid_att_index.  There are two index entries for many of
the key values, but that's unsurprising if a database-wide vacuum
analyze had been done or in progress --- just-obsoleted pg_statistic
rows wouldn't have been vacuumed away yet.  I'm inclined to think that
this problem is a side-effect of the fact that you had to use
pg_resetxlog; there was probably an update to pg_statistic that got
lost.  If you want to continue experimenting with this database,
I'd suggest TRUNCATE'ing pg_statistic and rebuilding it via another
VACUUM ANALYZE run.

> I'll send over pg_statistic_relid_att_index and Panel_pkey.  Showing
> the keys to Panel_pkey is no problem.

My little index printing tool shows these entries in Panel_pkey at btree
level zero:

...
40    2006    RES032706     CH0327RES         high key on page 606
40    2006    RES032706     RES032706         383    11
40    2006    RES032706     RES032706         high key on page 608
40    2006    RES040306     CC0403RES         507    14
40    2006    RES040306     CCC0403RES        551    1
40    2006    RES040306     CCC0403RES        high key on page 601
40    2006    RES040306     RES040306         500    1
40    2006    RES040306     RES040306         high key on page 640
40    2006    RES040306     RES040306         high key on page 636
40    2006    RES040306     RES040306         high key on page 635
41    0001    2000POOL      0001              159    3
41    0001    2000POOL      0002              159    4
41    0001    2000POOL      0003              159    5
...

(The first four columns are the key values of this index; the last two
are the pointed-to heap tuple's page/line location.  High keys are
printed after any data keys on the page.)  The down-links at level one
look like:

...
40    2006    RES032706     CCC0327RES        606    1
40    2006    RES032706     CH0327RES         608    1
40    2006    RES032706     RES032706         601    1
40    2006    RES040306     CCC0403RES        640    1
40    2006    RES040306     RES040306         635    1
40    2006    RES040306     RES040306         636    1
40    2006    RES040306     RES040306         635    1
40    2006    RES040306     RES040306         629    1
41    0405    0405          0105              166    1
...

This is fairly interesting because we've got three pages with the same
boundary key.  The bogus entry for page 635 has been inserted where
you'd expect it to get inserted if the insertion were being done on the
basis of key comparison.  (We usually insert a new entry in front of any
ones with the same key.)  But we never do insertions on non-leaf pages
by key comparison!  Upper-level entries are only made during page
splits, by _bt_insert_parent, and that always works by locating the
down-link to the page just split and adding the new entry just after it.

One thing I realize after seeing this is that the index corruption might
be of long standing: the way that the btree search algorithms work, no
search would ever have descended directly to 635 or 636, but instead to
640 and then traverse right from there.  So you'd not have noticed any
malfunction, until the time came to try to delete the page.  That means
we shouldn't assume that the problem was created recently.

Now that we know the problem is associated with keys '40 2006 RES040306
RES040306', can you check your logs and see if anything interesting
happened around the time those keys were being inserted?  (Note: if
Panel_pkey really is a primary key, ie unique, it might seem odd for
there to be several pages worth of identical entries --- this would
imply that the same record had been updated several hundred times
between two vacuums.)

I still kinda like the theory that the extra key got in there because
btree_xlog_cleanup fired inappropriately ... mainly because I can't
think of any other theories ... but I'm still baffled about the details.
Anyone have any ideas?

            regards, tom lane

Re: right sibling is not next child

From
"Peter Brant"
Date:
The middle tier transaction log indicates this record was inserted into
the county database at 2006-03-31 21:00:32.94.  It would have hit the
central databases sometime thereafter (more or less immediately if all
was well).

The Panel table contains some running statistics which are updated
frequently so a bunch of updates to the same record isn't unexpected.

Sure enough, the log on the 31st looks pretty ugly.  A few selections
(this still would have been with full_page_writes = off):

(a bunch of these)
[2006-03-31 13:00:01.720 ] 2636 LOG:  could not fsync segment 0 of
relation 1663/16385/1392439: Permission denied
[2006-03-31 13:00:01.720 ] 2636 ERROR:  storage sync failed on magnetic
disk: Permission denied

[2006-03-31 14:31:05.314 ] 608 LOG:  database system was not properly
shut down; automatic recovery in progress

[2006-03-31 14:50:54.828 ] 4052 WARNING:  relation "DbTranImageStatus"
page 130 is uninitialized --- fixing
[2006-03-31 14:50:55.000 ] 4052 PANIC:  right sibling is not next child
in "DbTranImageStatus_pkey" (as part of an autovacuum)

[2006-03-31 22:13:39.140 ] 2816 <postgres bigbird 127.0.0.1(1658)>
WARNING:  relation "Assess" page 259792 is uninitialized --- fixing
[2006-03-31 22:19:18.484 ] 2816 <postgres bigbird 127.0.0.1(1658)>
WARNING:  relation "Charge" page 338120 is uninitialized --- fixing
[2006-03-31 23:16:55.671 ] 2816 <postgres bigbird 127.0.0.1(1658)>
WARNING:  relation "Case" page 391222 is uninitialized --- fixing

and finally

[2006-03-31 23:54:15.859 ] 2816 <postgres bigbird 127.0.0.1(1658)>
PANIC:  right sibling is not next child in "Panel_pkey"

It looks like this has been happening with the nightly "vacuum analyze"
since then, but nobody noticed until "heap_update_redo: no block" took
things down for the count.

Thanks very much for your efforts.  If nothing else, we've learned a
valuable lesson about keeping a better eye on the logs. :-(

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 10:28 pm >>>
One thing I realize after seeing this is that the index corruption
might
be of long standing: the way that the btree search algorithms work, no
search would ever have descended directly to 635 or 636, but instead
to
640 and then traverse right from there.  So you'd not have noticed any
malfunction, until the time came to try to delete the page.  That
means
we shouldn't assume that the problem was created recently.

Now that we know the problem is associated with keys '40 2006
RES040306
RES040306', can you check your logs and see if anything interesting
happened around the time those keys were being inserted?  (Note: if
Panel_pkey really is a primary key, ie unique, it might seem odd for
there to be several pages worth of identical entries --- this would
imply that the same record had been updated several hundred times
between two vacuums.)

I still kinda like the theory that the extra key got in there because
btree_xlog_cleanup fired inappropriately ... mainly because I can't
think of any other theories ... but I'm still baffled about the
details.
Anyone have any ideas?

            regards, tom lane

Re: right sibling is not next child

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> (a bunch of these)
> [2006-03-31 13:00:01.720 ] 2636 LOG:  could not fsync segment 0 of
> relation 1663/16385/1392439: Permission denied
> [2006-03-31 13:00:01.720 ] 2636 ERROR:  storage sync failed on magnetic
> disk: Permission denied

Yoi.  I think we have seen this (ie, bogus "permission denied" write
errors) happen before on Windows, but no one's quite figured out why.
How many is "a bunch", and what time period did they continue over?
(I think you probably would have gotten one per attempted checkpoint,
but want to confirm.)  Did they persist up till that first database
crash at 14:31?  What was the immediate cause of the crash (I'm
expecting a PANIC or possibly an Assert triggered it)?

BTW, what sort of filesystem is the database sitting on, on this machine?

> [2006-03-31 14:50:54.828 ] 4052 WARNING:  relation "DbTranImageStatus"
> page 130 is uninitialized --- fixing

I'd like to think that all of these are accounted for by the "permission
denied" write failures, but it's hard to tell unless you have log
entries for permission failures on all of their OIDs.  (BTW, PG would
probably still have recovered from that, if not for the full_page_writes
= off bug.)

> [2006-03-31 14:50:55.000 ] 4052 PANIC:  right sibling is not next child
> in "DbTranImageStatus_pkey" (as part of an autovacuum)

Hmph.  So there are (at least) two occurrences of this in the DB.
Does that index contain any sensitive data, and if not could I trouble
you for a copy?  I'm still not clear on the mechanism by which the
indexes got corrupted like this.  It's starting to look like there may
have been some system-level misfeasance involved, but I'd still like
to understand it in more detail.

            regards, tom lane

Re: right sibling is not next child

From
Tom Lane
Date:
I wrote:
> Does that index contain any sensitive data, and if not could I trouble
> you for a copy?  I'm still not clear on the mechanism by which the
> indexes got corrupted like this.

Oh, never mind ... I've sussed it.

nbtxlog.c's forget_matching_split() assumes it can look into the page
that was just updated to get the block number associated with a non-leaf
insertion.  This is OK *only if the page has exactly its state at the
time of the WAL record*.  However, btree_xlog_insert() is coded to do
nothing if the page has an LSN larger than the WAL record's LSN --- that
is, if the page reflects a state *later than* this insertion.  So if the
page is newer than that --- say, there were some subsequent insertions at
earlier positions in the page --- forget_matching_split() would pick up
the wrong downlink and hence fail to erase the pending split it should
have erased.

I believe this bug is only latent whenever full_page_writes = on,
because in that situation the first touch of any index page after a
checkpoint will rewrite the whole page, and so we'll never be looking
at an index page state newer than the WAL record.  That explains why
no one has tripped over it before.

The particular case we are looking at in Panel_pkey seems to require
some additional assumptions to explain the state of the index, but
I've got no doubt this is the core of the problem.

Since we're not going to support full_page_writes = off in 8.1.*,
there's no need for a back-patched fix, but I'll see about making it
safer in HEAD.

            regards, tom lane

Permission denied on fsync / Win32 (was right sibling is not next child)

From
"Peter Brant"
Date:
It turns out we've been getting rather huge numbers of "Permission
denied" errors relating to fsync so perhaps it wasn't really a precursor
to the crash as I'd previously thought.

I've pasted in a complete list following this email covering the time
span from 3/20 to 4/6.  The number in the first column is the number of
times the given log message appeared.

The interesting thing is that _none_ of the referenced relfilenode
numbers actually appear in the file system.

In a possibly related note, I've confirmed there is a race condition on
Windows when the temporary stats file is renamed to the working one.
There is apparently a window where a backend opening the stats file will
find it missing.  I'll send more info later.  I haven't had a chance to
come back to it yet.

Regarding your other questions:
  - The file system is NTFS
  - Regarding the initial crash, looking more closely, I don't think it
was a crash at all (or at most it was a human-induced "crash").  In the
log everything looks normal (assuming the Permission denied errors are
"normal"), and then

[2006-03-31 14:26:30.705 ] 2328 LOG:  received fast shutdown request
[2006-03-31 14:35:33.173 ] 4016 <viewer bigbird 127.0.0.1(3413)> FATAL:
 the database system is shutting down
[2006-03-31 14:35:33.189 ] 6504 <viewer bigbird 127.0.0.1(3414)> FATAL:
 the database system is shutting down
....
[2006-03-31 14:39:53.595 ] 7576 <viewer bigbird 165.219.80.39(36220)>
FATAL:  the database system is shutting down

and in the next log file

[2006-03-31 14:31:05.298 ] 608 LOG:  database system was interrupted at
2006-03-31 13:20:06 Central Standard Time
[2006-03-31 14:31:05.314 ] 608 LOG:  checkpoint record is at
EF/B41D7580
[2006-03-31 14:31:05.314 ] 608 LOG:  redo record is at EF/B41A0C08;
undo record is at 0/0; shutdown FALSE
[2006-03-31 14:31:05.314 ] 608 LOG:  next transaction ID: 295492806;
next OID: 1395901
[2006-03-31 14:31:05.314 ] 608 LOG:  next MultiXactId: 1; next
MultiXactOffset: 0
[2006-03-31 14:31:05.314 ] 608 LOG:  database system was not properly
shut down; automatic recovery in progress

Pete

    747 LOG:  could not fsync segment 0 of relation 1663/16385/1361661:
Permission denied
    414 LOG:  could not fsync segment 0 of relation 1663/16385/1363194:
Permission denied
      2 LOG:  could not fsync segment 0 of relation 1663/16385/1363196:
Permission denied
    441 LOG:  could not fsync segment 0 of relation 1663/16385/1369401:
Permission denied
   4520 LOG:  could not fsync segment 0 of relation 1663/16385/1373027:
Permission denied
   1024 LOG:  could not fsync segment 0 of relation 1663/16385/1374375:
Permission denied
   2683 LOG:  could not fsync segment 0 of relation 1663/16385/1375726:
Permission denied
    775 LOG:  could not fsync segment 0 of relation 1663/16385/1375733:
Permission denied
     83 LOG:  could not fsync segment 0 of relation 1663/16385/1377367:
Permission denied
     64 LOG:  could not fsync segment 0 of relation 1663/16385/1377685:
Permission denied
   3334 LOG:  could not fsync segment 0 of relation 1663/16385/1379641:
Permission denied
     16 LOG:  could not fsync segment 0 of relation 1663/16385/1381290:
Permission denied
    819 LOG:  could not fsync segment 0 of relation 1663/16385/1383833:
Permission denied
    347 LOG:  could not fsync segment 0 of relation 1663/16385/1386037:
Permission denied
      1 LOG:  could not fsync segment 0 of relation 1663/16385/1388257:
Permission denied
    135 LOG:  could not fsync segment 0 of relation 1663/16385/1388264:
Permission denied
   6397 LOG:  could not fsync segment 0 of relation 1663/16385/1389813:
Permission denied
   3719 LOG:  could not fsync segment 0 of relation 1663/16385/1391589:
Permission denied
    763 LOG:  could not fsync segment 0 of relation 1663/16385/1391610:
Permission denied
  10784 LOG:  could not fsync segment 0 of relation 1663/16385/1392439:
Permission denied
   5846 LOG:  could not fsync segment 0 of relation 1663/16385/1392444:
Permission denied
      1 LOG:  could not fsync segment 0 of relation 1663/16385/2282587:
Permission denied
    243 LOG:  could not fsync segment 0 of relation 1663/16385/2282602:
Permission denied
    526 LOG:  could not fsync segment 0 of relation 1663/16385/2293690:
Permission denied
   1754 LOG:  could not fsync segment 0 of relation 1663/16385/2293695:
Permission denied
    894 LOG:  could not fsync segment 0 of relation 1663/16385/2304886:
Permission denied
    692 LOG:  could not fsync segment 0 of relation 1663/16385/2304891:
Permission denied
   1556 LOG:  could not fsync segment 0 of relation 1663/16385/2315303:
Permission denied


>>> Tom Lane <tgl@sss.pgh.pa.us> 04/13/06 2:30 am >>>
crash at 14:31?  What was the immediate cause of the crash (I'm
expecting a PANIC or possibly an Assert triggered it)?

BTW, what sort of filesystem is the database sitting on, on this
machine?

Re: right sibling is not next child

From
"Peter Brant"
Date:
Sounds good.

There is nothing sensitive in DbTranImageStatus_pkey so if you decide
you want it after all, it's there for the asking.

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/13/06 3:30 am >>>
Oh, never mind ... I've sussed it.
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> It turns out we've been getting rather huge numbers of "Permission
> denied" errors relating to fsync so perhaps it wasn't really a precursor
> to the crash as I'd previously thought.

> I've pasted in a complete list following this email covering the time
> span from 3/20 to 4/6.  The number in the first column is the number of
> times the given log message appeared.

Wow.  What was happening to your pg_xlog directory while this was going
on?  I would expect that the system would plow ahead after this error,
but having failed to complete the checkpoint, it would never be able to
free any back WAL segments.  Were you accumulating lots of gigabytes of
WAL files?  Or maybe the errors came and went, so that sometimes you
could get through a checkpoint?

> The interesting thing is that _none_ of the referenced relfilenode
> numbers actually appear in the file system.

Could they have been temporary tables?  Alternatively, if you routinely
use TRUNCATE, CLUSTER, or REINDEX (all of which assign new relfilenode
numbers), then maybe they were older versions of tables that still
exist.

>   - The file system is NTFS

OK, anyone know anything about permissions on NTFS?

            regards, tom lane

Re: Permission denied on fsync / Win32 (was right

From
"Peter Brant"
Date:
The culprit is CLUSTER.  There is a batch file which runs CLUSTER
against six, relatively small (60k rows between them) tables at 7am,
1pm, and 9pm.  Follows is the list of dates and hours when the
"Permission denied" errors showed up.  They match up to a tee (although
the error apparently sometimes persists for a while).

The machine is clean (basically just Windows + Postgres [no AV,
firewall, etc. software]).

Pete

2006-03-20 21
2006-03-21 07
2006-03-22 21
2006-03-23 21
2006-03-23 22
2006-03-24 13
2006-03-24 21
2006-03-24 22
2006-03-26 13
2006-03-27 13
2006-03-27 21
2006-03-27 22
2006-03-28 13
2006-03-28 21
2006-03-29 13
2006-03-29 21
2006-03-30 13
2006-03-30 14
2006-03-30 15
2006-03-30 21
2006-03-30 22
2006-03-31 07
2006-03-31 08
2006-03-31 09
2006-03-31 10
2006-03-31 11
2006-03-31 12
2006-03-31 13
2006-04-03 21
2006-04-04 07
2006-04-05 07
2006-04-05 21


>>> Tom Lane <tgl@sss.pgh.pa.us> 04/13/06 8:30 pm >>>
> The interesting thing is that _none_ of the referenced relfilenode
> numbers actually appear in the file system.

Could they have been temporary tables?  Alternatively, if you
routinely
use TRUNCATE, CLUSTER, or REINDEX (all of which assign new relfilenode
numbers), then maybe they were older versions of tables that still
exist.

Re: Permission denied on fsync / Win32 (was right

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> The culprit is CLUSTER.  There is a batch file which runs CLUSTER
> against six, relatively small (60k rows between them) tables at 7am,
> 1pm, and 9pm.  Follows is the list of dates and hours when the
> "Permission denied" errors showed up.  They match up to a tee (although
> the error apparently sometimes persists for a while).

OK ... but what's still unclear is whether the failures are occurring
against the old relfilenode (the one just removed by the CLUSTER) or the
new one just added by CLUSTER.  If you note the relfilenodes assigned to
these tables just before and just after the next cycle of CLUSTERs, it
should be easy to tell what the complaints refer to.

            regards, tom lane

Re: Permission denied on fsync / Win32 (was right

From
"Peter Brant"
Date:
Apparently we got lucky on all four servers with the latest cycle, so
nothing to report.  Load (both reading and writing) is quite light today
so perhaps the bug is only triggered under a higher load.  It seems the
problem typically doesn't show up on weekends either (when load is also
much lighter for us).

In any case, we're logging the relfilenodes before and after now, so
I'll post again when the problem crops up again next week.

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/14/06 2:41 am >>>
OK ... but what's still unclear is whether the failures are occurring
against the old relfilenode (the one just removed by the CLUSTER) or
the
new one just added by CLUSTER.  If you note the relfilenodes assigned
to
these tables just before and just after the next cycle of CLUSTERs, it
should be easy to tell what the complaints refer to.

            regards, tom lane

Re: Permission denied on fsync / Win32 (was right

From
"Peter Brant"
Date:
The error messages refer to the old relfilenode (in 3 out of 3
occurrences today).

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/14/06 2:41 am >>>
OK ... but what's still unclear is whether the failures are occurring
against the old relfilenode (the one just removed by the CLUSTER) or
the
new one just added by CLUSTER.  If you note the relfilenodes assigned
to
these tables just before and just after the next cycle of CLUSTERs, it
should be easy to tell what the complaints refer to.

            regards, tom lane

Re: Permission denied on fsync / Win32 (was right

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> The error messages refer to the old relfilenode (in 3 out of 3
> occurrences today).

So it'd seem the problem is with fsync on recently-deleted files.
Is it possible that we are getting EACCES (ERROR_SHARING_VIOLATION
maybe) in the situation where we try to fsync a file that's been
unlinked but isn't fully gone yet due to open handles?

            regards, tom lane