Re: B-tree crash recovery error in 8.3 beta 2 - Mailing list pgsql-bugs
From | Heikki Linnakangas |
---|---|
Subject | Re: B-tree crash recovery error in 8.3 beta 2 |
Date | |
Msg-id | 473D7A41.5040507@enterprisedb.com Whole thread Raw |
In response to | B-tree crash recovery error in 8.3 beta 2 (Koichi Suzuki <suzuki.koichi@oss.ntt.co.jp>) |
Responses |
Re: B-tree crash recovery error in 8.3 beta 2
|
List | pgsql-bugs |
Koichi Suzuki wrote: > I've found that B-tree crash recovery in 8.3 beta2 could make some > tuples invisible through B-tree. They're visible if we read using but > Seq-Scan. This happens in 8.3 beta2, but not in 8.2.4. Here's how it > happens. > > 1. Create b-tree for a text type column. > 2. Make B-tree three-story, that is, root-intermediate-leaf. Insert > tuples sufficient to construct such B-tree. > 3. No checkpoint should occur during 2. > 4. Kill postmaster. > 5. Restart postmaster. Crash recovery will be done. > 6. Tuples with column values less than HIKEY becomes invisible through > Idx-scan, still visible through Seq-scan. > > From the dump of B-tree, it seems that HIKEY value is cleared (only > tuple header is left). No problem was found in the case of integer or > numeric type columns. > > Attached is the shell script, postgresql.conf (almost the default one) > to reproduce the problem, and the log of the problem reproduction. Thanks for the excellent reproducer script! There seems to be a bug in the B-tree split WAL reduction patch from Februrary. On split, we copy the HIKEY of the left page from the leftmost item on the right page, but that doesn't work because the leftmost key is not stored on intermediate levels. Patch attached that stores the high key explicitly in the WAL record on intermediate levels. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/nbtree/nbtinsert.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v retrieving revision 1.161 diff -c -r1.161 nbtinsert.c *** src/backend/access/nbtree/nbtinsert.c 15 Nov 2007 21:14:32 -0000 1.161 --- src/backend/access/nbtree/nbtinsert.c 16 Nov 2007 11:03:46 -0000 *************** *** 1004,1010 **** xl_btree_split xlrec; uint8 xlinfo; XLogRecPtr recptr; ! XLogRecData rdata[6]; XLogRecData *lastrdata; xlrec.node = rel->rd_node; --- 1004,1010 ---- xl_btree_split xlrec; uint8 xlinfo; XLogRecPtr recptr; ! XLogRecData rdata[7]; XLogRecData *lastrdata; xlrec.node = rel->rd_node; *************** *** 1020,1034 **** lastrdata = &rdata[0]; - /* Log downlink on non-leaf pages. */ if (ropaque->btpo.level > 0) { lastrdata->next = lastrdata + 1; lastrdata++; lastrdata->data = (char *) &newitem->t_tid.ip_blkid; lastrdata->len = sizeof(BlockIdData); lastrdata->buffer = InvalidBuffer; } /* --- 1020,1044 ---- lastrdata = &rdata[0]; if (ropaque->btpo.level > 0) { + /* Log downlink */ lastrdata->next = lastrdata + 1; lastrdata++; lastrdata->data = (char *) &newitem->t_tid.ip_blkid; lastrdata->len = sizeof(BlockIdData); lastrdata->buffer = InvalidBuffer; + + /* Log high key of left page */ + lastrdata->next = lastrdata + 1; + lastrdata++; + + itemid = PageGetItemId(origpage, P_HIKEY); + item = (IndexTuple) PageGetItem(origpage, itemid); + lastrdata->data = (char *) item; + lastrdata->len = IndexTupleSize(item); + lastrdata->buffer = InvalidBuffer; } /* Index: src/backend/access/nbtree/nbtxlog.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtxlog.c,v retrieving revision 1.47 diff -c -r1.47 nbtxlog.c *** src/backend/access/nbtree/nbtxlog.c 15 Nov 2007 21:14:32 -0000 1.47 --- src/backend/access/nbtree/nbtxlog.c 16 Nov 2007 10:59:59 -0000 *************** *** 273,278 **** --- 273,280 ---- OffsetNumber newitemoff = 0; Item newitem = NULL; Size newitemsz = 0; + Item left_hikey = NULL; /* only 16-bit aligned */ + Size left_hikeysz = 0; reln = XLogOpenRelation(xlrec->node); *************** *** 289,294 **** --- 291,303 ---- datalen -= sizeof(BlockIdData); forget_matching_split(xlrec->node, downlink, false); + + /* Extract left hikey and its size (still assuming 16-bit alignment) */ + left_hikey = (Item) datapos; + left_hikeysz = IndexTupleSize(left_hikey); + + datapos += left_hikeysz; + datalen -= left_hikeysz; } /* Extract newitem and newitemoff, if present */ *************** *** 333,338 **** --- 342,357 ---- _bt_restore_page(rpage, datapos, datalen); + /* On leaf level, the high key of the left page is equal to the + * first key on the right page. + */ + if (xlrec->level == 0) + { + ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); + left_hikey = PageGetItem(rpage, hiItemId); + left_hikeysz = ItemIdGetLength(hiItemId); + } + PageSetLSN(rpage, lsn); PageSetTLI(rpage, ThisTimeLineID); MarkBufferDirty(rbuf); *************** *** 360,367 **** OffsetNumber maxoff = PageGetMaxOffsetNumber(lpage); OffsetNumber deletable[MaxOffsetNumber]; int ndeletable = 0; - ItemId hiItemId; - Item hiItem; /* * Remove the items from the left page that were copied to the --- 379,384 ---- *************** *** 394,404 **** elog(PANIC, "failed to add new item to left page after split"); } ! /* Set high key equal to the first key on the right page */ ! hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); ! hiItem = PageGetItem(rpage, hiItemId); ! ! if (PageAddItem(lpage, hiItem, ItemIdGetLength(hiItemId), P_HIKEY, false, false) == InvalidOffsetNumber) elog(PANIC, "failed to add high key to left page after split"); --- 411,418 ---- elog(PANIC, "failed to add new item to left page after split"); } ! /* Set high key */ ! if (PageAddItem(lpage, left_hikey, left_hikeysz, P_HIKEY, false, false) == InvalidOffsetNumber) elog(PANIC, "failed to add high key to left page after split"); Index: src/include/access/nbtree.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v retrieving revision 1.114 diff -c -r1.114 nbtree.h *** src/include/access/nbtree.h 15 Nov 2007 21:14:42 -0000 1.114 --- src/include/access/nbtree.h 16 Nov 2007 10:59:59 -0000 *************** *** 289,294 **** --- 289,298 ---- * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit * aligned.) * + * If level > 0, IndexTuple representing the HIKEY of the left page + * follows. We don't need it on leaf pages, because it's the same + * as the leftmost key on the new right page. + * * In the _L variants, next are OffsetNumber newitemoff and the new item. * (In the _R variants, the new item is one of the right page's tuples.) *
pgsql-bugs by date: