Thread: right sibling is not next child
I'm reporting this as a PostgreSQL bug because it involves an index corruption. I can't see any other way our application should be able to corrupt an index. I will attach the tail of the log when the corruption was detected (and the postmaster shut itself down), as well as the subsequent attempt to start. Fortunately we run our web site off of a farm of four database servers, so we are taking one of the others out of the mix, stopping postmaster, and copying its data directory over to this machine for recovery, so we don't need advice on that aspect of things; but, we'd like to do what we can to help track down the cause to prevent a recurrence. We have renamed the data directory to make room for recovery at the normal location, but otherwise the failing data directory structure is unmodified. For context, this is running on Windows 2003 Server. Eight Xeon box, no HT, 6 GB RAM, 13 drive 15,000 RPM RAID5 array through battery backed controller for everything. This database is about 180 GB with about 300 tables. We have autovacuum running every ten seconds because of a few very small tables with very high update rates, and we have a scheduled VACUUM ANALYZE VERBOSE every night. It appears that last night's vacuum found the problem, which the previous night's vacuum didn't. We had some event which started at 14:25 yesterday which persisted until we restarted the middle tier at 15:04. The symptom was that a fraction of the queries which normally run in a few ms were timing out on a 20 second limit. pg_locks showed no blocking. We've been getting episodes with these symptoms occassionally, but they have only lasted a minute or two; this duration was unusual. We haven't identified a cause. One odd thing is that with the number of queries per second that we run, the number of timeouts during an episode is too small to support the notion that _all_ similar queries are failing. How best to proceed? -Kevin
Attachment
Sorry about the delay in responding. We had a bit of difficulty with the test machine. Kevin is also on vacation this week. The problem is repeatable with a VACUUM. I've found the offending block. A (partial) pg_filedump of that block is pasted in below. I'm a little lost as to what the next step is though. Any pointers would be appreciated. Pete "Kevin Grittner" <Kevin ( dot ) Grittner ( at ) wicourts ( dot ) gov> writes: > [2006-04-06 02:19:57.460 ] 3848 <postgres bigbird 127.0.0.1(3944)> PANIC: right sibling is not next child in "Panel_pkey" This should be repeatable by re-attempting a VACUUM, right? Please find out which page exactly it's unhappy about (either gdb the crash or add a printout of the "parent" variable to the elog call in nbtpage.c), then pg_filedump the index and look to see what the index contains. regards, tom lane PANIC: right sibling is not next child in "Panel_pkey", parent is 271 ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1 * * File: 180571 * Options used: -R 271 -i * * Dump created on: Tue Apr 11 13:38:44 2006 ******************************************************************* Block 271 ******************************************************** <Header> ----- Block Offset: 0x0021e000 Offsets: Lower 468 (0x01d4) Block: Size 8192 Version 3 Upper 1952 (0x07a0) LSN: logid 246 recoff 0x265d47c0 Special 8176 (0x1ff0) Items: 112 Free Space: 1484 Length (including item array): 472 <Data> ------ Item 1 -- Length: 56 Offset: 8120 (0x1fb8) Flags: USED Block Id: 132 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 2 -- Length: 8 Offset: 8112 (0x1fb0) Flags: USED Block Id: 201 linp Index: 1 Size: 8 Has Nulls: 0 Has Varwidths: 0 Item 3 -- Length: 56 Offset: 8056 (0x1f78) Flags: USED Block Id: 257 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 ... snip ... Item 109 -- Length: 56 Offset: 4920 (0x1338) Flags: USED Block Id: 121 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 110 -- Length: 56 Offset: 4864 (0x1300) Flags: USED Block Id: 133 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 111 -- Length: 56 Offset: 4808 (0x12c8) Flags: USED Block Id: 134 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 112 -- Length: 56 Offset: 4752 (0x1290) Flags: USED Block Id: 137 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 <Special Section> ----- BTree Index Section: Flags: 0x0000 () Blocks: Previous (167) Next (455) Level (1) *** End of Requested Range Encountered. Last Block Read: 271 ***
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > PANIC: right sibling is not next child in "Panel_pkey", parent is 271 Hmm ... that's not actually enough info to tell us where to look, is it :-(. Please add the following variables to the elog message, or gdb for them if you can: target rightsib nextoffset regards, tom lane
Try as I might, I wasn't able to get a JIT debugger give me a memory dump. It seems like postgres.exe is not really crashing in the "unhandled exception" sense (see gdb log below)? Am I missing a configure option? (As an aside, what's the best way to get a core dump on Windows? Can gdb read memory dumps produced by NTSD?) I was able to attach to the running backend using gdb. That worked (kind of, see log below). I ended up modifying the elog again with the following results: PANIC: right sibling is not next child in "Panel_pkey", parent is 271, target is 635, rightsib is 629, nextoffset is 87 C:\WINDOWS\system32>C:\mingw\bin\gdb C:\pgsql\bin\postgres.exe ... snip ... (gdb) attach 2084 Attaching to program `C:\pgsql\bin\postgres.exe', process 2084 [Switching to thread 2084.0x930] (gdb) break nbtpage.c:983 Breakpoint 1 at 0x41f042: file nbtpage.c, line 983. (gdb) continue Continuing. [Switching to thread 2084.0x81c] Breakpoint 1, _bt_pagedel (rel=0x196bbb0, buf=1089, vacuum_full=0 '\0') at nbtpage.c:983 983 nbtpage.c: No such file or directory. in nbtpage.c (gdb) print parent $1 = 271 (gdb) print target $2 = 635 (gdb) print rightsib No symbol "rightsib" in current context. (gdb) print nextoffset $3 = 87 (gdb) print leftsib $4 = 636 (gdb) print rightsib No symbol "rightsib" in current context. (gdb) continue Continuing. Program exited with code 03. (gdb) Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/11/06 9:19 pm >>> "Peter Brant" <Peter.Brant@wicourts.gov> writes: > PANIC: right sibling is not next child in "Panel_pkey", parent is 271 Hmm ... that's not actually enough info to tell us where to look, is it :-(. Please add the following variables to the elog message, or gdb for them if you can: target rightsib nextoffset regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > I ended up modifying the elog again with the following results: > PANIC: right sibling is not next child in "Panel_pkey", parent is 271, > target is 635, rightsib is 629, nextoffset is 87 OK, so the part of the pg_filedump info we need to see is items 86/87 in block 271, plus a few around them for good measure. Are the values of the keys in this index sensitive data, or are they just serial numbers of some kind? If we could see pg_dump -i -f rather than just -i, it might help. I'd like to see the dumps of pages 635, 636, 629 as well (mostly their "special space" contents). regards, tom lane
The index data isn't sensitive, but I should ask for permission nonetheless. I'll send over the '-f' output tomorrow morning. Pete ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1 * * File: 180571 * Options used: -i -R 271 * * Dump created on: Tue Apr 11 18:22:24 2006 ******************************************************************* Block 271 ******************************************************** <Header> ----- Block Offset: 0x0021e000 Offsets: Lower 468 (0x01d4) Block: Size 8192 Version 3 Upper 1952 (0x07a0) LSN: logid 246 recoff 0x265d47c0 Special 8176 (0x1ff0) Items: 112 Free Space: 1484 Length (including item array): 472 <Data> ------ ... snip ... Item 80 -- Length: 56 Offset: 3352 (0x0d18) Flags: USED Block Id: 581 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 81 -- Length: 56 Offset: 3128 (0x0c38) Flags: USED Block Id: 580 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 82 -- Length: 56 Offset: 2848 (0x0b20) Flags: USED Block Id: 606 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 83 -- Length: 56 Offset: 2736 (0x0ab0) Flags: USED Block Id: 608 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 84 -- Length: 56 Offset: 2792 (0x0ae8) Flags: USED Block Id: 601 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 85 -- Length: 56 Offset: 2120 (0x0848) Flags: USED Block Id: 640 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 86 -- Length: 56 Offset: 2176 (0x0880) Flags: USED Block Id: 635 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 87 -- Length: 56 Offset: 2232 (0x08b8) Flags: USED Block Id: 636 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 88 -- Length: 56 Offset: 2288 (0x08f0) Flags: USED Block Id: 635 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 89 -- Length: 56 Offset: 2400 (0x0960) Flags: USED Block Id: 629 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 90 -- Length: 56 Offset: 5704 (0x1648) Flags: USED Block Id: 166 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 91 -- Length: 56 Offset: 5648 (0x1610) Flags: USED Block Id: 339 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 92 -- Length: 56 Offset: 5592 (0x15d8) Flags: USED Block Id: 372 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 93 -- Length: 56 Offset: 4416 (0x1140) Flags: USED Block Id: 480 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 Item 94 -- Length: 56 Offset: 5536 (0x15a0) Flags: USED Block Id: 208 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 ... snip ... <Special Section> ----- BTree Index Section: Flags: 0x0000 () Blocks: Previous (167) Next (455) Level (1) *** End of Requested Range Encountered. Last Block Read: 271 *** ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1 * * File: 180571 * Options used: -i -R 635 * * Dump created on: Tue Apr 11 18:22:43 2006 ******************************************************************* Block 635 ******************************************************** <Header> ----- Block Offset: 0x004f6000 Offsets: Lower 24 (0x0018) Block: Size 8192 Version 3 Upper 8120 (0x1fb8) LSN: logid 239 recoff 0xcabae788 Special 8176 (0x1ff0) Items: 1 Free Space: 8096 Length (including item array): 28 <Data> ------ Item 1 -- Length: 56 Offset: 8120 (0x1fb8) Flags: USED Block Id: 452 linp Index: 1 Size: 56 Has Nulls: 0 Has Varwidths: 16384 <Special Section> ----- BTree Index Section: Flags: 0x0001 (LEAF) Blocks: Previous (636) Next (629) Level (0) *** End of Requested Range Encountered. Last Block Read: 635 *** ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1 * * File: 180571 * Options used: -i -R 636 * * Dump created on: Tue Apr 11 18:22:50 2006 ******************************************************************* Block 636 ******************************************************** <Header> ----- Block Offset: 0x004f8000 Offsets: Lower 24 (0x0018) Block: Size 8192 Version 3 Upper 8120 (0x1fb8) LSN: logid 245 recoff 0x9bfa3b60 Special 8176 (0x1ff0) Items: 1 Free Space: 8096 Length (including item array): 28 <Data> ------ Item 1 -- Length: 56 Offset: 8120 (0x1fb8) Flags: USED Block Id: 454 linp Index: 27 Size: 56 Has Nulls: 0 Has Varwidths: 16384 <Special Section> ----- BTree Index Section: Flags: 0x0001 (LEAF) Blocks: Previous (640) Next (635) Level (0) *** End of Requested Range Encountered. Last Block Read: 636 *** ******************************************************************* * PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1 * * File: 180571 * Options used: -i -R 629 * * Dump created on: Tue Apr 11 18:23:01 2006 ******************************************************************* Block 629 ******************************************************** <Header> ----- Block Offset: 0x004ea000 Offsets: Lower 244 (0x00f4) Block: Size 8192 Version 3 Upper 5040 (0x13b0) LSN: logid 239 recoff 0xcabae788 Special 8176 (0x1ff0) Items: 56 Free Space: 4796 Length (including item array): 248 <Data> ------ ... snip ... <Special Section> ----- BTree Index Section: Flags: 0x0001 (LEAF) Blocks: Previous (635) Next (166) Level (0) *** End of Requested Range Encountered. Last Block Read: 629 *** >>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 1:06 am >>> OK, so the part of the pg_filedump info we need to see is items 86/87 in block 271, plus a few around them for good measure. Are the values of the keys in this index sensitive data, or are they just serial numbers of some kind? If we could see pg_dump -i -f rather than just -i, it might help. I'd like to see the dumps of pages 635, 636, 629 as well (mostly their "special space" contents). regards, tom lane
Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it actually doesn't even get to Panel and errors out with: INFO: analyzing "public.MaintCode" INFO: "MaintCode": scanned 1 of 1 pages, containing 19 live rows and 0 dead rows; 19 rows in sample, 19 estimated total rows ERROR: duplicate key violates unique constraint "pg_statistic_relid_att_index" MaintCode is a tiny, static table. Does that indicate further corruption or is there a more benign explanation? Could these errors not be Postgres' fault? (e.g. hardware-related) Pete
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it > actually doesn't even get to Panel and errors out with: > ERROR: duplicate key violates unique constraint > "pg_statistic_relid_att_index" Hm, my eyebrows just disappeared over the back of my head somewhere. Is that repeatable? Does a REINDEX on pg_statistic make it go away? regards, tom lane
It is repeatable. A reindex doesn't work. Pete bigbird=# vacuum analyze "MaintCode"; ERROR: duplicate key violates unique constraint "pg_statistic_relid_att_index" bigbird=# vacuum analyze verbose "MaintCode"; INFO: vacuuming "public.MaintCode" INFO: index "MaintCode_pkey" now contains 19 row versions in 2 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "MaintCode": found 0 removable, 19 nonremovable row versions in 1 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: vacuuming "pg_toast.pg_toast_90472" INFO: index "pg_toast_90472_index" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_toast_90472": found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: analyzing "public.MaintCode" INFO: "MaintCode": scanned 1 of 1 pages, containing 19 live rows and 0 dead rows; 19 rows in sample, 19 estimated total rows ERROR: duplicate key violates unique constraint "pg_statistic_relid_att_index" bigbird=# reindex table pg_statistic; ERROR: could not create unique index DETAIL: Table contains duplicated values. bigbird=# select * from "MaintCode"; maintCode | colorAttrb | descr | isActive | maintPriority | stopsTheClock | videoAttrb | wcisSpecStatus -----------+------------+---------------------------+----------+---------------+---------------+------------+---------------- AA | R | Annual accounts | t | 75 | t | N | AP | R | Appeal | t | 30 | f | N | AW | R | Awaiting Papers | t | 80 | f | N | BA | R | Bankruptcy | t | 20 | t | N | S05 CI | R | Case Inactive | t | 60 | f | N | S03 CS | R | Consent decree | t | 90 | t | N | DP | R | Deferred judgt./prosecute | t | 40 | t | N | S07 EF | R | Electronic filing | t | 33 | f | N | ES | R | Extension | t | 55 | t | N | EX | R | Expungement | t | 5 | f | N | FX | R | Future expungement | t | 93 | f | N | IN | R | Interpreter | t | 53 | f | N | JR | R | Judicial review | t | 75 | t | N | RO | R | Reopen | t | 35 | f | N | SL | R | Sealed Record | t | 85 | f | N | SM | R | Sentence modification | t | 43 | f | N | SU | R | Case suspended | t | 45 | t | N | UA | R | Under Advisement | t | 50 | f | N | S02 WA | R | Warrant issued | t | 10 | t | F | S04 (19 rows) bigbird=# >>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 5:00 am >>> "Peter Brant" <Peter.Brant@wicourts.gov> writes: > Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it > actually doesn't even get to Panel and errors out with: > ERROR: duplicate key violates unique constraint > "pg_statistic_relid_att_index" Hm, my eyebrows just disappeared over the back of my head somewhere. Is that repeatable? Does a REINDEX on pg_statistic make it go away? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > bigbird=# vacuum analyze "MaintCode"; > ERROR: duplicate key violates unique constraint > "pg_statistic_relid_att_index" Hm, can you see any rows in pg_statistic with duplicate values of (starelid, staattnum)? If so it'd be useful to look at their ctid/xmin/xmax/cmin/cmax values. regards, tom lane
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > Item 85 -- Length: 56 Offset: 2120 (0x0848) Flags: USED > Block Id: 640 linp Index: 1 Size: 56 > Has Nulls: 0 Has Varwidths: 16384 > Item 86 -- Length: 56 Offset: 2176 (0x0880) Flags: USED > Block Id: 635 linp Index: 1 Size: 56 > Has Nulls: 0 Has Varwidths: 16384 > Item 87 -- Length: 56 Offset: 2232 (0x08b8) Flags: USED > Block Id: 636 linp Index: 1 Size: 56 > Has Nulls: 0 Has Varwidths: 16384 > Item 88 -- Length: 56 Offset: 2288 (0x08f0) Flags: USED > Block Id: 635 linp Index: 1 Size: 56 > Has Nulls: 0 Has Varwidths: 16384 > Item 89 -- Length: 56 Offset: 2400 (0x0960) Flags: USED > Block Id: 629 linp Index: 1 Size: 56 > Has Nulls: 0 Has Varwidths: 16384 > Item 90 -- Length: 56 Offset: 5704 (0x1648) Flags: USED > Block Id: 166 linp Index: 1 Size: 56 > Has Nulls: 0 Has Varwidths: 16384 Well, that's pretty dang interesting. How did block 635 get to be listed twice? The sibling links say that the correct sequence is 640, 636, 635, 629, 166 ... so something screwed up the parent level's keys. What would be most useful at this point is to look at the keys in these entries, and compare them to the "high keys" (item 1) of the individual leaf pages. I'm wondering what key is in that extra entry for 635 ... Did you get permission to show us the keys? regards, tom lane
I can't find any duplicates?!? The query select starelid, staattnum, ctid, xmin, xmax, cmin, cmax from pg_statistic p1 where (select count(*) from pg_statistic p2 where p1.starelid = p2.starelid and p1.staattnum = p2.staattnum) > 1 doesn't turn up anything. Nor does dumping select starelid, staattnum from pg_statistic to a file and using sort | uniq -c Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 4:37 pm >>> Hm, can you see any rows in pg_statistic with duplicate values of (starelid, staattnum)? If so it'd be useful to look at their ctid/xmin/xmax/cmin/cmax values. regards, tom lane
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > I can't find any duplicates?!? Weirder and weirder. Maybe the table is OK but the index is corrupt? Could it be another symptom of the same problem we're seeing in the Panel_pkey index? I'm currently theorizing that that index might've been corrupted during WAL replay (specifically, if the replayer somehow failed to recognize a split completion, btree_xlog_cleanup would've inserted an extra entry) and one could certainly imagine multiple indexes getting corrupted in the same fashion at the same time. You did say that this problem showed up shortly after a database crash, right? Could you send me a copy of pg_statistic_relid_att_index, off-list (the actual disk file, not a pg_filedump)? There's nothing but table OIDs and attribute numbers in it, so I can't see any reason anyone would consider it sensitive data. I've got some really crude code laying about for scanning an index and looking for inconsistencies ... it's too ugly to give out, but I'd like to see if it can find anything wrong with that index. (I'll probably ask for a copy of Panel_pkey for the same purpose, if you get permission to show us its keys.) regards, tom lane
Per the DBAs, there hadn't been any recent crashes before last Thursday. A "vacuum analyze verbose" discovered the problem early Thursday morning. After the PANIC, the database never came back up (the heap_clean_redo: no block / full_page_writes = off problem). One thing that seems strange to me is that the original crash on Thursday failed on Panel_pkey, but my "vacuum analyze verbose" on a copy of the crashed database failed on MaintCode / pg_statistic_relid_att_index. I'll send over pg_statistic_relid_att_index and Panel_pkey. Showing the keys to Panel_pkey is no problem. Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 6:03 pm >>> You did say that this problem showed up shortly after a database crash, right? Could you send me a copy of pg_statistic_relid_att_index, off-list (the actual disk file, not a pg_filedump)? There's nothing but table OIDs and attribute numbers in it, so I can't see any reason anyone would consider it sensitive data. I've got some really crude code laying about for scanning an index and looking for inconsistencies ... it's too ugly to give out, but I'd like to see if it can find anything wrong with that index. (I'll probably ask for a copy of Panel_pkey for the same purpose, if you get permission to show us its keys.) regards, tom lane
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > One thing that seems strange to me is that the original crash on > Thursday failed on Panel_pkey, but my "vacuum analyze verbose" on a copy > of the crashed database failed on MaintCode / > pg_statistic_relid_att_index. I can't find anything particularly wrong with pg_statistic_relid_att_index. There are two index entries for many of the key values, but that's unsurprising if a database-wide vacuum analyze had been done or in progress --- just-obsoleted pg_statistic rows wouldn't have been vacuumed away yet. I'm inclined to think that this problem is a side-effect of the fact that you had to use pg_resetxlog; there was probably an update to pg_statistic that got lost. If you want to continue experimenting with this database, I'd suggest TRUNCATE'ing pg_statistic and rebuilding it via another VACUUM ANALYZE run. > I'll send over pg_statistic_relid_att_index and Panel_pkey. Showing > the keys to Panel_pkey is no problem. My little index printing tool shows these entries in Panel_pkey at btree level zero: ... 40 2006 RES032706 CH0327RES high key on page 606 40 2006 RES032706 RES032706 383 11 40 2006 RES032706 RES032706 high key on page 608 40 2006 RES040306 CC0403RES 507 14 40 2006 RES040306 CCC0403RES 551 1 40 2006 RES040306 CCC0403RES high key on page 601 40 2006 RES040306 RES040306 500 1 40 2006 RES040306 RES040306 high key on page 640 40 2006 RES040306 RES040306 high key on page 636 40 2006 RES040306 RES040306 high key on page 635 41 0001 2000POOL 0001 159 3 41 0001 2000POOL 0002 159 4 41 0001 2000POOL 0003 159 5 ... (The first four columns are the key values of this index; the last two are the pointed-to heap tuple's page/line location. High keys are printed after any data keys on the page.) The down-links at level one look like: ... 40 2006 RES032706 CCC0327RES 606 1 40 2006 RES032706 CH0327RES 608 1 40 2006 RES032706 RES032706 601 1 40 2006 RES040306 CCC0403RES 640 1 40 2006 RES040306 RES040306 635 1 40 2006 RES040306 RES040306 636 1 40 2006 RES040306 RES040306 635 1 40 2006 RES040306 RES040306 629 1 41 0405 0405 0105 166 1 ... This is fairly interesting because we've got three pages with the same boundary key. The bogus entry for page 635 has been inserted where you'd expect it to get inserted if the insertion were being done on the basis of key comparison. (We usually insert a new entry in front of any ones with the same key.) But we never do insertions on non-leaf pages by key comparison! Upper-level entries are only made during page splits, by _bt_insert_parent, and that always works by locating the down-link to the page just split and adding the new entry just after it. One thing I realize after seeing this is that the index corruption might be of long standing: the way that the btree search algorithms work, no search would ever have descended directly to 635 or 636, but instead to 640 and then traverse right from there. So you'd not have noticed any malfunction, until the time came to try to delete the page. That means we shouldn't assume that the problem was created recently. Now that we know the problem is associated with keys '40 2006 RES040306 RES040306', can you check your logs and see if anything interesting happened around the time those keys were being inserted? (Note: if Panel_pkey really is a primary key, ie unique, it might seem odd for there to be several pages worth of identical entries --- this would imply that the same record had been updated several hundred times between two vacuums.) I still kinda like the theory that the extra key got in there because btree_xlog_cleanup fired inappropriately ... mainly because I can't think of any other theories ... but I'm still baffled about the details. Anyone have any ideas? regards, tom lane
The middle tier transaction log indicates this record was inserted into the county database at 2006-03-31 21:00:32.94. It would have hit the central databases sometime thereafter (more or less immediately if all was well). The Panel table contains some running statistics which are updated frequently so a bunch of updates to the same record isn't unexpected. Sure enough, the log on the 31st looks pretty ugly. A few selections (this still would have been with full_page_writes = off): (a bunch of these) [2006-03-31 13:00:01.720 ] 2636 LOG: could not fsync segment 0 of relation 1663/16385/1392439: Permission denied [2006-03-31 13:00:01.720 ] 2636 ERROR: storage sync failed on magnetic disk: Permission denied [2006-03-31 14:31:05.314 ] 608 LOG: database system was not properly shut down; automatic recovery in progress [2006-03-31 14:50:54.828 ] 4052 WARNING: relation "DbTranImageStatus" page 130 is uninitialized --- fixing [2006-03-31 14:50:55.000 ] 4052 PANIC: right sibling is not next child in "DbTranImageStatus_pkey" (as part of an autovacuum) [2006-03-31 22:13:39.140 ] 2816 <postgres bigbird 127.0.0.1(1658)> WARNING: relation "Assess" page 259792 is uninitialized --- fixing [2006-03-31 22:19:18.484 ] 2816 <postgres bigbird 127.0.0.1(1658)> WARNING: relation "Charge" page 338120 is uninitialized --- fixing [2006-03-31 23:16:55.671 ] 2816 <postgres bigbird 127.0.0.1(1658)> WARNING: relation "Case" page 391222 is uninitialized --- fixing and finally [2006-03-31 23:54:15.859 ] 2816 <postgres bigbird 127.0.0.1(1658)> PANIC: right sibling is not next child in "Panel_pkey" It looks like this has been happening with the nightly "vacuum analyze" since then, but nobody noticed until "heap_update_redo: no block" took things down for the count. Thanks very much for your efforts. If nothing else, we've learned a valuable lesson about keeping a better eye on the logs. :-( Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/12/06 10:28 pm >>> One thing I realize after seeing this is that the index corruption might be of long standing: the way that the btree search algorithms work, no search would ever have descended directly to 635 or 636, but instead to 640 and then traverse right from there. So you'd not have noticed any malfunction, until the time came to try to delete the page. That means we shouldn't assume that the problem was created recently. Now that we know the problem is associated with keys '40 2006 RES040306 RES040306', can you check your logs and see if anything interesting happened around the time those keys were being inserted? (Note: if Panel_pkey really is a primary key, ie unique, it might seem odd for there to be several pages worth of identical entries --- this would imply that the same record had been updated several hundred times between two vacuums.) I still kinda like the theory that the extra key got in there because btree_xlog_cleanup fired inappropriately ... mainly because I can't think of any other theories ... but I'm still baffled about the details. Anyone have any ideas? regards, tom lane
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > (a bunch of these) > [2006-03-31 13:00:01.720 ] 2636 LOG: could not fsync segment 0 of > relation 1663/16385/1392439: Permission denied > [2006-03-31 13:00:01.720 ] 2636 ERROR: storage sync failed on magnetic > disk: Permission denied Yoi. I think we have seen this (ie, bogus "permission denied" write errors) happen before on Windows, but no one's quite figured out why. How many is "a bunch", and what time period did they continue over? (I think you probably would have gotten one per attempted checkpoint, but want to confirm.) Did they persist up till that first database crash at 14:31? What was the immediate cause of the crash (I'm expecting a PANIC or possibly an Assert triggered it)? BTW, what sort of filesystem is the database sitting on, on this machine? > [2006-03-31 14:50:54.828 ] 4052 WARNING: relation "DbTranImageStatus" > page 130 is uninitialized --- fixing I'd like to think that all of these are accounted for by the "permission denied" write failures, but it's hard to tell unless you have log entries for permission failures on all of their OIDs. (BTW, PG would probably still have recovered from that, if not for the full_page_writes = off bug.) > [2006-03-31 14:50:55.000 ] 4052 PANIC: right sibling is not next child > in "DbTranImageStatus_pkey" (as part of an autovacuum) Hmph. So there are (at least) two occurrences of this in the DB. Does that index contain any sensitive data, and if not could I trouble you for a copy? I'm still not clear on the mechanism by which the indexes got corrupted like this. It's starting to look like there may have been some system-level misfeasance involved, but I'd still like to understand it in more detail. regards, tom lane
I wrote: > Does that index contain any sensitive data, and if not could I trouble > you for a copy? I'm still not clear on the mechanism by which the > indexes got corrupted like this. Oh, never mind ... I've sussed it. nbtxlog.c's forget_matching_split() assumes it can look into the page that was just updated to get the block number associated with a non-leaf insertion. This is OK *only if the page has exactly its state at the time of the WAL record*. However, btree_xlog_insert() is coded to do nothing if the page has an LSN larger than the WAL record's LSN --- that is, if the page reflects a state *later than* this insertion. So if the page is newer than that --- say, there were some subsequent insertions at earlier positions in the page --- forget_matching_split() would pick up the wrong downlink and hence fail to erase the pending split it should have erased. I believe this bug is only latent whenever full_page_writes = on, because in that situation the first touch of any index page after a checkpoint will rewrite the whole page, and so we'll never be looking at an index page state newer than the WAL record. That explains why no one has tripped over it before. The particular case we are looking at in Panel_pkey seems to require some additional assumptions to explain the state of the index, but I've got no doubt this is the core of the problem. Since we're not going to support full_page_writes = off in 8.1.*, there's no need for a back-patched fix, but I'll see about making it safer in HEAD. regards, tom lane
It turns out we've been getting rather huge numbers of "Permission denied" errors relating to fsync so perhaps it wasn't really a precursor to the crash as I'd previously thought. I've pasted in a complete list following this email covering the time span from 3/20 to 4/6. The number in the first column is the number of times the given log message appeared. The interesting thing is that _none_ of the referenced relfilenode numbers actually appear in the file system. In a possibly related note, I've confirmed there is a race condition on Windows when the temporary stats file is renamed to the working one. There is apparently a window where a backend opening the stats file will find it missing. I'll send more info later. I haven't had a chance to come back to it yet. Regarding your other questions: - The file system is NTFS - Regarding the initial crash, looking more closely, I don't think it was a crash at all (or at most it was a human-induced "crash"). In the log everything looks normal (assuming the Permission denied errors are "normal"), and then [2006-03-31 14:26:30.705 ] 2328 LOG: received fast shutdown request [2006-03-31 14:35:33.173 ] 4016 <viewer bigbird 127.0.0.1(3413)> FATAL: the database system is shutting down [2006-03-31 14:35:33.189 ] 6504 <viewer bigbird 127.0.0.1(3414)> FATAL: the database system is shutting down .... [2006-03-31 14:39:53.595 ] 7576 <viewer bigbird 165.219.80.39(36220)> FATAL: the database system is shutting down and in the next log file [2006-03-31 14:31:05.298 ] 608 LOG: database system was interrupted at 2006-03-31 13:20:06 Central Standard Time [2006-03-31 14:31:05.314 ] 608 LOG: checkpoint record is at EF/B41D7580 [2006-03-31 14:31:05.314 ] 608 LOG: redo record is at EF/B41A0C08; undo record is at 0/0; shutdown FALSE [2006-03-31 14:31:05.314 ] 608 LOG: next transaction ID: 295492806; next OID: 1395901 [2006-03-31 14:31:05.314 ] 608 LOG: next MultiXactId: 1; next MultiXactOffset: 0 [2006-03-31 14:31:05.314 ] 608 LOG: database system was not properly shut down; automatic recovery in progress Pete 747 LOG: could not fsync segment 0 of relation 1663/16385/1361661: Permission denied 414 LOG: could not fsync segment 0 of relation 1663/16385/1363194: Permission denied 2 LOG: could not fsync segment 0 of relation 1663/16385/1363196: Permission denied 441 LOG: could not fsync segment 0 of relation 1663/16385/1369401: Permission denied 4520 LOG: could not fsync segment 0 of relation 1663/16385/1373027: Permission denied 1024 LOG: could not fsync segment 0 of relation 1663/16385/1374375: Permission denied 2683 LOG: could not fsync segment 0 of relation 1663/16385/1375726: Permission denied 775 LOG: could not fsync segment 0 of relation 1663/16385/1375733: Permission denied 83 LOG: could not fsync segment 0 of relation 1663/16385/1377367: Permission denied 64 LOG: could not fsync segment 0 of relation 1663/16385/1377685: Permission denied 3334 LOG: could not fsync segment 0 of relation 1663/16385/1379641: Permission denied 16 LOG: could not fsync segment 0 of relation 1663/16385/1381290: Permission denied 819 LOG: could not fsync segment 0 of relation 1663/16385/1383833: Permission denied 347 LOG: could not fsync segment 0 of relation 1663/16385/1386037: Permission denied 1 LOG: could not fsync segment 0 of relation 1663/16385/1388257: Permission denied 135 LOG: could not fsync segment 0 of relation 1663/16385/1388264: Permission denied 6397 LOG: could not fsync segment 0 of relation 1663/16385/1389813: Permission denied 3719 LOG: could not fsync segment 0 of relation 1663/16385/1391589: Permission denied 763 LOG: could not fsync segment 0 of relation 1663/16385/1391610: Permission denied 10784 LOG: could not fsync segment 0 of relation 1663/16385/1392439: Permission denied 5846 LOG: could not fsync segment 0 of relation 1663/16385/1392444: Permission denied 1 LOG: could not fsync segment 0 of relation 1663/16385/2282587: Permission denied 243 LOG: could not fsync segment 0 of relation 1663/16385/2282602: Permission denied 526 LOG: could not fsync segment 0 of relation 1663/16385/2293690: Permission denied 1754 LOG: could not fsync segment 0 of relation 1663/16385/2293695: Permission denied 894 LOG: could not fsync segment 0 of relation 1663/16385/2304886: Permission denied 692 LOG: could not fsync segment 0 of relation 1663/16385/2304891: Permission denied 1556 LOG: could not fsync segment 0 of relation 1663/16385/2315303: Permission denied >>> Tom Lane <tgl@sss.pgh.pa.us> 04/13/06 2:30 am >>> crash at 14:31? What was the immediate cause of the crash (I'm expecting a PANIC or possibly an Assert triggered it)? BTW, what sort of filesystem is the database sitting on, on this machine?
Sounds good. There is nothing sensitive in DbTranImageStatus_pkey so if you decide you want it after all, it's there for the asking. Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/13/06 3:30 am >>> Oh, never mind ... I've sussed it.
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > It turns out we've been getting rather huge numbers of "Permission > denied" errors relating to fsync so perhaps it wasn't really a precursor > to the crash as I'd previously thought. > I've pasted in a complete list following this email covering the time > span from 3/20 to 4/6. The number in the first column is the number of > times the given log message appeared. Wow. What was happening to your pg_xlog directory while this was going on? I would expect that the system would plow ahead after this error, but having failed to complete the checkpoint, it would never be able to free any back WAL segments. Were you accumulating lots of gigabytes of WAL files? Or maybe the errors came and went, so that sometimes you could get through a checkpoint? > The interesting thing is that _none_ of the referenced relfilenode > numbers actually appear in the file system. Could they have been temporary tables? Alternatively, if you routinely use TRUNCATE, CLUSTER, or REINDEX (all of which assign new relfilenode numbers), then maybe they were older versions of tables that still exist. > - The file system is NTFS OK, anyone know anything about permissions on NTFS? regards, tom lane
The culprit is CLUSTER. There is a batch file which runs CLUSTER against six, relatively small (60k rows between them) tables at 7am, 1pm, and 9pm. Follows is the list of dates and hours when the "Permission denied" errors showed up. They match up to a tee (although the error apparently sometimes persists for a while). The machine is clean (basically just Windows + Postgres [no AV, firewall, etc. software]). Pete 2006-03-20 21 2006-03-21 07 2006-03-22 21 2006-03-23 21 2006-03-23 22 2006-03-24 13 2006-03-24 21 2006-03-24 22 2006-03-26 13 2006-03-27 13 2006-03-27 21 2006-03-27 22 2006-03-28 13 2006-03-28 21 2006-03-29 13 2006-03-29 21 2006-03-30 13 2006-03-30 14 2006-03-30 15 2006-03-30 21 2006-03-30 22 2006-03-31 07 2006-03-31 08 2006-03-31 09 2006-03-31 10 2006-03-31 11 2006-03-31 12 2006-03-31 13 2006-04-03 21 2006-04-04 07 2006-04-05 07 2006-04-05 21 >>> Tom Lane <tgl@sss.pgh.pa.us> 04/13/06 8:30 pm >>> > The interesting thing is that _none_ of the referenced relfilenode > numbers actually appear in the file system. Could they have been temporary tables? Alternatively, if you routinely use TRUNCATE, CLUSTER, or REINDEX (all of which assign new relfilenode numbers), then maybe they were older versions of tables that still exist.
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > The culprit is CLUSTER. There is a batch file which runs CLUSTER > against six, relatively small (60k rows between them) tables at 7am, > 1pm, and 9pm. Follows is the list of dates and hours when the > "Permission denied" errors showed up. They match up to a tee (although > the error apparently sometimes persists for a while). OK ... but what's still unclear is whether the failures are occurring against the old relfilenode (the one just removed by the CLUSTER) or the new one just added by CLUSTER. If you note the relfilenodes assigned to these tables just before and just after the next cycle of CLUSTERs, it should be easy to tell what the complaints refer to. regards, tom lane
Apparently we got lucky on all four servers with the latest cycle, so nothing to report. Load (both reading and writing) is quite light today so perhaps the bug is only triggered under a higher load. It seems the problem typically doesn't show up on weekends either (when load is also much lighter for us). In any case, we're logging the relfilenodes before and after now, so I'll post again when the problem crops up again next week. Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/14/06 2:41 am >>> OK ... but what's still unclear is whether the failures are occurring against the old relfilenode (the one just removed by the CLUSTER) or the new one just added by CLUSTER. If you note the relfilenodes assigned to these tables just before and just after the next cycle of CLUSTERs, it should be easy to tell what the complaints refer to. regards, tom lane
The error messages refer to the old relfilenode (in 3 out of 3 occurrences today). Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/14/06 2:41 am >>> OK ... but what's still unclear is whether the failures are occurring against the old relfilenode (the one just removed by the CLUSTER) or the new one just added by CLUSTER. If you note the relfilenodes assigned to these tables just before and just after the next cycle of CLUSTERs, it should be easy to tell what the complaints refer to. regards, tom lane
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > The error messages refer to the old relfilenode (in 3 out of 3 > occurrences today). So it'd seem the problem is with fsync on recently-deleted files. Is it possible that we are getting EACCES (ERROR_SHARING_VIOLATION maybe) in the situation where we try to fsync a file that's been unlinked but isn't fully gone yet due to open handles? regards, tom lane