Re: B-tree crash recovery error in 8.3 beta 2 - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: B-tree crash recovery error in 8.3 beta 2
Date
Msg-id 473D7A41.5040507@enterprisedb.com
Whole thread Raw
In response to B-tree crash recovery error in 8.3 beta 2  (Koichi Suzuki <suzuki.koichi@oss.ntt.co.jp>)
Responses Re: B-tree crash recovery error in 8.3 beta 2
List pgsql-bugs
Koichi Suzuki wrote:
> I've found that B-tree crash recovery in 8.3 beta2 could make some
> tuples invisible through B-tree.  They're visible if we read using but
> Seq-Scan.   This happens in 8.3 beta2, but not in 8.2.4.  Here's how it
> happens.
>
> 1. Create b-tree for a text type column.
> 2. Make B-tree three-story, that is, root-intermediate-leaf.  Insert
> tuples sufficient to construct such B-tree.
> 3. No checkpoint should occur during 2.
> 4. Kill postmaster.
> 5. Restart postmaster.   Crash recovery will be done.
> 6. Tuples with column values less than HIKEY becomes invisible through
> Idx-scan, still visible through Seq-scan.
>
> From the dump of B-tree, it seems that HIKEY value is cleared (only
> tuple header is left).  No problem was found in the case of integer or
> numeric type columns.
>
> Attached is the shell script, postgresql.conf (almost the default one)
> to reproduce the problem, and the log of the problem reproduction.

Thanks for the excellent reproducer script!

There seems to be a bug in the B-tree split WAL reduction patch from
Februrary. On split, we copy the HIKEY of the left page from the
leftmost item on the right page, but that doesn't work because the
leftmost key is not stored on intermediate levels.

Patch attached that stores the high key explicitly in the WAL record on
intermediate levels.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.161
diff -c -r1.161 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    15 Nov 2007 21:14:32 -0000    1.161
--- src/backend/access/nbtree/nbtinsert.c    16 Nov 2007 11:03:46 -0000
***************
*** 1004,1010 ****
          xl_btree_split xlrec;
          uint8        xlinfo;
          XLogRecPtr    recptr;
!         XLogRecData rdata[6];
          XLogRecData *lastrdata;

          xlrec.node = rel->rd_node;
--- 1004,1010 ----
          xl_btree_split xlrec;
          uint8        xlinfo;
          XLogRecPtr    recptr;
!         XLogRecData rdata[7];
          XLogRecData *lastrdata;

          xlrec.node = rel->rd_node;
***************
*** 1020,1034 ****

          lastrdata = &rdata[0];

-         /* Log downlink on non-leaf pages. */
          if (ropaque->btpo.level > 0)
          {
              lastrdata->next = lastrdata + 1;
              lastrdata++;

              lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
              lastrdata->len = sizeof(BlockIdData);
              lastrdata->buffer = InvalidBuffer;
          }

          /*
--- 1020,1044 ----

          lastrdata = &rdata[0];

          if (ropaque->btpo.level > 0)
          {
+             /* Log downlink */
              lastrdata->next = lastrdata + 1;
              lastrdata++;

              lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
              lastrdata->len = sizeof(BlockIdData);
              lastrdata->buffer = InvalidBuffer;
+
+             /* Log high key of left page */
+             lastrdata->next = lastrdata + 1;
+             lastrdata++;
+
+             itemid = PageGetItemId(origpage, P_HIKEY);
+             item = (IndexTuple) PageGetItem(origpage, itemid);
+             lastrdata->data = (char *) item;
+             lastrdata->len = IndexTupleSize(item);
+             lastrdata->buffer = InvalidBuffer;
          }

          /*
Index: src/backend/access/nbtree/nbtxlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtxlog.c,v
retrieving revision 1.47
diff -c -r1.47 nbtxlog.c
*** src/backend/access/nbtree/nbtxlog.c    15 Nov 2007 21:14:32 -0000    1.47
--- src/backend/access/nbtree/nbtxlog.c    16 Nov 2007 10:59:59 -0000
***************
*** 273,278 ****
--- 273,280 ----
      OffsetNumber newitemoff = 0;
      Item        newitem = NULL;
      Size        newitemsz = 0;
+     Item        left_hikey = NULL; /* only 16-bit aligned */
+     Size        left_hikeysz = 0;

      reln = XLogOpenRelation(xlrec->node);

***************
*** 289,294 ****
--- 291,303 ----
          datalen -= sizeof(BlockIdData);

          forget_matching_split(xlrec->node, downlink, false);
+
+         /* Extract left hikey and its size (still assuming 16-bit alignment) */
+         left_hikey = (Item) datapos;
+         left_hikeysz = IndexTupleSize(left_hikey);
+
+         datapos += left_hikeysz;
+         datalen -= left_hikeysz;
      }

      /* Extract newitem and newitemoff, if present */
***************
*** 333,338 ****
--- 342,357 ----

      _bt_restore_page(rpage, datapos, datalen);

+     /* On leaf level, the high key of the left page is equal to the
+      * first key on the right page.
+      */
+     if (xlrec->level == 0)
+     {
+         ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
+         left_hikey = PageGetItem(rpage, hiItemId);
+         left_hikeysz = ItemIdGetLength(hiItemId);
+     }
+
      PageSetLSN(rpage, lsn);
      PageSetTLI(rpage, ThisTimeLineID);
      MarkBufferDirty(rbuf);
***************
*** 360,367 ****
                  OffsetNumber maxoff = PageGetMaxOffsetNumber(lpage);
                  OffsetNumber deletable[MaxOffsetNumber];
                  int            ndeletable = 0;
-                 ItemId        hiItemId;
-                 Item        hiItem;

                  /*
                   * Remove the items from the left page that were copied to the
--- 379,384 ----
***************
*** 394,404 ****
                          elog(PANIC, "failed to add new item to left page after split");
                  }

!                 /* Set high key equal to the first key on the right page */
!                 hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
!                 hiItem = PageGetItem(rpage, hiItemId);
!
!                 if (PageAddItem(lpage, hiItem, ItemIdGetLength(hiItemId),
                                  P_HIKEY, false, false) == InvalidOffsetNumber)
                      elog(PANIC, "failed to add high key to left page after split");

--- 411,418 ----
                          elog(PANIC, "failed to add new item to left page after split");
                  }

!                 /* Set high key */
!                 if (PageAddItem(lpage, left_hikey, left_hikeysz,
                                  P_HIKEY, false, false) == InvalidOffsetNumber)
                      elog(PANIC, "failed to add high key to left page after split");

Index: src/include/access/nbtree.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.114
diff -c -r1.114 nbtree.h
*** src/include/access/nbtree.h    15 Nov 2007 21:14:42 -0000    1.114
--- src/include/access/nbtree.h    16 Nov 2007 10:59:59 -0000
***************
*** 289,294 ****
--- 289,298 ----
       * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
       * aligned.)
       *
+      * If level > 0, IndexTuple representing the HIKEY of the left page
+      * follows. We don't need it on leaf pages, because it's the same
+      * as the leftmost key on the new right page.
+      *
       * In the _L variants, next are OffsetNumber newitemoff and the new item.
       * (In the _R variants, the new item is one of the right page's tuples.)
       *

pgsql-bugs by date:

Previous
From: Zdenek Kotala
Date:
Subject: Re: BUG #3752: query yields "could not find block containing chunk", then server crashes
Next
From: "Lance J. Andersen"
Date:
Subject: Re: BUG #3751: Conversion error using PreparedStatement.setObject()