Re: [GENERAL] 3rd time is a charm.....right sibling is not next child crash. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [GENERAL] 3rd time is a charm.....right sibling is not next child crash.
Date
Msg-id 1975.1283039034@sss.pgh.pa.us
Whole thread Raw
In response to Re: [GENERAL] 3rd time is a charm.....right sibling is not next child crash.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> it did occur to me that there is a simple way to ameliorate this
> problem: we could rearrange the code in _bt_pagedel() so it checks for
> this case before entering its critical section.  Then, corruption of
> this kind is at least only an ERROR not a PANIC.

In particular, I propose the attached patch, which gets rid of
unnecessary PANIC cases in _bt_split as well as _bt_pagedel;
all of these get reported from the field every now and then.
In passing this also makes the error messages out of _bt_split
a bit more detailed.

FWIW, I think the original rationale for PANIC here was so we could
capture a core dump for study; but since no one has ever yet cooperated
by providing such a dump, it seems like not panicking is a better plan.

Barring objections, I plan to back-patch this as far as it will
conveniently go (looks like 8.2 or 8.3 or so).

            regards, tom lane

Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.178
diff -c -r1.178 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    28 Mar 2010 09:27:01 -0000    1.178
--- src/backend/access/nbtree/nbtinsert.c    28 Aug 2010 23:31:44 -0000
***************
*** 74,82 ****
  static void _bt_checksplitloc(FindSplitData *state,
                    OffsetNumber firstoldonright, bool newitemonleft,
                    int dataitemstoleft, Size firstoldonrightsz);
! static void _bt_pgaddtup(Relation rel, Page page,
!              Size itemsize, IndexTuple itup,
!              OffsetNumber itup_off, const char *where);
  static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
              int keysz, ScanKey scankey);
  static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
--- 74,81 ----
  static void _bt_checksplitloc(FindSplitData *state,
                    OffsetNumber firstoldonright, bool newitemonleft,
                    int dataitemstoleft, Size firstoldonrightsz);
! static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
!              OffsetNumber itup_off);
  static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
              int keysz, ScanKey scankey);
  static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
***************
*** 753,759 ****
          /* Do the update.  No ereport(ERROR) until changes are logged */
          START_CRIT_SECTION();

!         _bt_pgaddtup(rel, page, itemsz, itup, newitemoff, "page");

          MarkBufferDirty(buf);

--- 752,760 ----
          /* Do the update.  No ereport(ERROR) until changes are logged */
          START_CRIT_SECTION();

!         if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
!             elog(PANIC, "failed to add new item to block %u in index \"%s\"",
!                  itup_blkno, RelationGetRelationName(rel));

          MarkBufferDirty(buf);

***************
*** 879,884 ****
--- 880,887 ----
      Page        origpage;
      Page        leftpage,
                  rightpage;
+     BlockNumber origpagenumber,
+                 rightpagenumber;
      BTPageOpaque ropaque,
                  lopaque,
                  oopaque;
***************
*** 894,904 ****
--- 897,923 ----
      OffsetNumber i;
      bool        isroot;

+     /* Acquire a new page to split into */
      rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
+
+     /*
+      * origpage is the original page to be split.  leftpage is a temporary
+      * buffer that receives the left-sibling data, which will be copied back
+      * into origpage on success.  rightpage is the new page that receives
+      * the right-sibling data.  If we fail before reaching the critical
+      * section, origpage hasn't been modified and leftpage is only workspace.
+      * In principle we shouldn't need to worry about rightpage either,
+      * because it hasn't been linked into the btree page structure; but to
+      * avoid leaving possibly-confusing junk behind, we are careful to reinit
+      * rightpage to empty before throwing any error.
+      */
      origpage = BufferGetPage(buf);
      leftpage = PageGetTempPage(origpage);
      rightpage = BufferGetPage(rbuf);

+     origpagenumber = BufferGetBlockNumber(buf);
+     rightpagenumber = BufferGetBlockNumber(rbuf);
+
      _bt_pageinit(leftpage, BufferGetPageSize(buf));
      /* rightpage was already initialized by _bt_getbuf */

***************
*** 923,930 ****
      lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
      ropaque->btpo_flags = lopaque->btpo_flags;
      lopaque->btpo_prev = oopaque->btpo_prev;
!     lopaque->btpo_next = BufferGetBlockNumber(rbuf);
!     ropaque->btpo_prev = BufferGetBlockNumber(buf);
      ropaque->btpo_next = oopaque->btpo_next;
      lopaque->btpo.level = ropaque->btpo.level = oopaque->btpo.level;
      /* Since we already have write-lock on both pages, ok to read cycleid */
--- 942,949 ----
      lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
      ropaque->btpo_flags = lopaque->btpo_flags;
      lopaque->btpo_prev = oopaque->btpo_prev;
!     lopaque->btpo_next = rightpagenumber;
!     ropaque->btpo_prev = origpagenumber;
      ropaque->btpo_next = oopaque->btpo_next;
      lopaque->btpo.level = ropaque->btpo.level = oopaque->btpo.level;
      /* Since we already have write-lock on both pages, ok to read cycleid */
***************
*** 947,955 ****
          item = (IndexTuple) PageGetItem(origpage, itemid);
          if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
                          false, false) == InvalidOffsetNumber)
!             elog(PANIC, "failed to add hikey to the right sibling"
                   " while splitting block %u of index \"%s\"",
!                  BufferGetBlockNumber(buf), RelationGetRelationName(rel));
          rightoff = OffsetNumberNext(rightoff);
      }

--- 966,977 ----
          item = (IndexTuple) PageGetItem(origpage, itemid);
          if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
                          false, false) == InvalidOffsetNumber)
!         {
!             _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!             elog(ERROR, "failed to add hikey to the right sibling"
                   " while splitting block %u of index \"%s\"",
!                  origpagenumber, RelationGetRelationName(rel));
!         }
          rightoff = OffsetNumberNext(rightoff);
      }

***************
*** 974,982 ****
      }
      if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
                      false, false) == InvalidOffsetNumber)
!         elog(PANIC, "failed to add hikey to the left sibling"
               " while splitting block %u of index \"%s\"",
!              BufferGetBlockNumber(buf), RelationGetRelationName(rel));
      leftoff = OffsetNumberNext(leftoff);

      /*
--- 996,1007 ----
      }
      if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
                      false, false) == InvalidOffsetNumber)
!     {
!         _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!         elog(ERROR, "failed to add hikey to the left sibling"
               " while splitting block %u of index \"%s\"",
!              origpagenumber, RelationGetRelationName(rel));
!     }
      leftoff = OffsetNumberNext(leftoff);

      /*
***************
*** 998,1011 ****
          {
              if (newitemonleft)
              {
!                 _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
!                              "left sibling");
                  leftoff = OffsetNumberNext(leftoff);
              }
              else
              {
!                 _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
!                              "right sibling");
                  rightoff = OffsetNumberNext(rightoff);
              }
          }
--- 1023,1046 ----
          {
              if (newitemonleft)
              {
!                 if (!_bt_pgaddtup(leftpage, newitemsz, newitem, leftoff))
!                 {
!                     _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!                     elog(ERROR, "failed to add new item to the left sibling"
!                          " while splitting block %u of index \"%s\"",
!                          origpagenumber, RelationGetRelationName(rel));
!                 }
                  leftoff = OffsetNumberNext(leftoff);
              }
              else
              {
!                 if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
!                 {
!                     _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!                     elog(ERROR, "failed to add new item to the right sibling"
!                          " while splitting block %u of index \"%s\"",
!                          origpagenumber, RelationGetRelationName(rel));
!                 }
                  rightoff = OffsetNumberNext(rightoff);
              }
          }
***************
*** 1013,1026 ****
          /* decide which page to put it on */
          if (i < firstright)
          {
!             _bt_pgaddtup(rel, leftpage, itemsz, item, leftoff,
!                          "left sibling");
              leftoff = OffsetNumberNext(leftoff);
          }
          else
          {
!             _bt_pgaddtup(rel, rightpage, itemsz, item, rightoff,
!                          "right sibling");
              rightoff = OffsetNumberNext(rightoff);
          }
      }
--- 1048,1071 ----
          /* decide which page to put it on */
          if (i < firstright)
          {
!             if (!_bt_pgaddtup(leftpage, itemsz, item, leftoff))
!             {
!                 _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!                 elog(ERROR, "failed to add old item to the left sibling"
!                      " while splitting block %u of index \"%s\"",
!                      origpagenumber, RelationGetRelationName(rel));
!             }
              leftoff = OffsetNumberNext(leftoff);
          }
          else
          {
!             if (!_bt_pgaddtup(rightpage, itemsz, item, rightoff))
!             {
!                 _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!                 elog(ERROR, "failed to add old item to the right sibling"
!                      " while splitting block %u of index \"%s\"",
!                      origpagenumber, RelationGetRelationName(rel));
!             }
              rightoff = OffsetNumberNext(rightoff);
          }
      }
***************
*** 1034,1041 ****
           * not be splitting the page).
           */
          Assert(!newitemonleft);
!         _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
!                      "right sibling");
          rightoff = OffsetNumberNext(rightoff);
      }

--- 1079,1091 ----
           * not be splitting the page).
           */
          Assert(!newitemonleft);
!         if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
!         {
!             _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!             elog(ERROR, "failed to add new item to the right sibling"
!                  " while splitting block %u of index \"%s\"",
!                  origpagenumber, RelationGetRelationName(rel));
!         }
          rightoff = OffsetNumberNext(rightoff);
      }

***************
*** 1047,1062 ****
       * neighbors.
       */

!     if (!P_RIGHTMOST(ropaque))
      {
!         sbuf = _bt_getbuf(rel, ropaque->btpo_next, BT_WRITE);
          spage = BufferGetPage(sbuf);
          sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
!         if (sopaque->btpo_prev != ropaque->btpo_prev)
!             elog(PANIC, "right sibling's left-link doesn't match: "
!                "block %u links to %u instead of expected %u in index \"%s\"",
!                  ropaque->btpo_next, sopaque->btpo_prev, ropaque->btpo_prev,
                   RelationGetRelationName(rel));

          /*
           * Check to see if we can set the SPLIT_END flag in the right-hand
--- 1097,1115 ----
       * neighbors.
       */

!     if (!P_RIGHTMOST(oopaque))
      {
!         sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
          spage = BufferGetPage(sbuf);
          sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
!         if (sopaque->btpo_prev != origpagenumber)
!         {
!             _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
!             elog(ERROR, "right sibling's left-link doesn't match: "
!                  "block %u links to %u instead of expected %u in index \"%s\"",
!                  oopaque->btpo_next, sopaque->btpo_prev, origpagenumber,
                   RelationGetRelationName(rel));
+         }

          /*
           * Check to see if we can set the SPLIT_END flag in the right-hand
***************
*** 1081,1088 ****
       *
       * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
       * not starting the critical section till here because we haven't been
!      * scribbling on the original page yet, and we don't care about the new
!      * sibling until it's linked into the btree.
       */
      START_CRIT_SECTION();

--- 1134,1140 ----
       *
       * NO EREPORT(ERROR) till right sibling is updated.  We can get away with
       * not starting the critical section till here because we haven't been
!      * scribbling on the original page yet; see comments above.
       */
      START_CRIT_SECTION();

***************
*** 1094,1112 ****
       * (in the page management code) that the center of a page always be
       * clean, and the most efficient way to guarantee this is just to compact
       * the data by reinserting it into a new left page.  (XXX the latter
!      * comment is probably obsolete.)
       *
       * We need to do this before writing the WAL record, so that XLogInsert
       * can WAL log an image of the page if necessary.
       */
      PageRestoreTempPage(leftpage, origpage);

      MarkBufferDirty(buf);
      MarkBufferDirty(rbuf);

      if (!P_RIGHTMOST(ropaque))
      {
!         sopaque->btpo_prev = BufferGetBlockNumber(rbuf);
          MarkBufferDirty(sbuf);
      }

--- 1146,1166 ----
       * (in the page management code) that the center of a page always be
       * clean, and the most efficient way to guarantee this is just to compact
       * the data by reinserting it into a new left page.  (XXX the latter
!      * comment is probably obsolete; but in any case it's good to not scribble
!      * on the original page until we enter the critical section.)
       *
       * We need to do this before writing the WAL record, so that XLogInsert
       * can WAL log an image of the page if necessary.
       */
      PageRestoreTempPage(leftpage, origpage);
+     /* leftpage, lopaque must not be used below here */

      MarkBufferDirty(buf);
      MarkBufferDirty(rbuf);

      if (!P_RIGHTMOST(ropaque))
      {
!         sopaque->btpo_prev = rightpagenumber;
          MarkBufferDirty(sbuf);
      }

***************
*** 1120,1127 ****
          XLogRecData *lastrdata;

          xlrec.node = rel->rd_node;
!         xlrec.leftsib = BufferGetBlockNumber(buf);
!         xlrec.rightsib = BufferGetBlockNumber(rbuf);
          xlrec.rnext = ropaque->btpo_next;
          xlrec.level = ropaque->btpo.level;
          xlrec.firstright = firstright;
--- 1174,1181 ----
          XLogRecData *lastrdata;

          xlrec.node = rel->rd_node;
!         xlrec.leftsib = origpagenumber;
!         xlrec.rightsib = rightpagenumber;
          xlrec.rnext = ropaque->btpo_next;
          xlrec.level = ropaque->btpo.level;
          xlrec.firstright = firstright;
***************
*** 1920,1932 ****
   *        we insert the tuples in order, so that the given itup_off does
   *        represent the final position of the tuple!
   */
! static void
! _bt_pgaddtup(Relation rel,
!              Page page,
               Size itemsize,
               IndexTuple itup,
!              OffsetNumber itup_off,
!              const char *where)
  {
      BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
      IndexTupleData trunctuple;
--- 1974,1984 ----
   *        we insert the tuples in order, so that the given itup_off does
   *        represent the final position of the tuple!
   */
! static bool
! _bt_pgaddtup(Page page,
               Size itemsize,
               IndexTuple itup,
!              OffsetNumber itup_off)
  {
      BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
      IndexTupleData trunctuple;
***************
*** 1941,1948 ****

      if (PageAddItem(page, (Item) itup, itemsize, itup_off,
                      false, false) == InvalidOffsetNumber)
!         elog(PANIC, "failed to add item to the %s in index \"%s\"",
!              where, RelationGetRelationName(rel));
  }

  /*
--- 1993,2001 ----

      if (PageAddItem(page, (Item) itup, itemsize, itup_off,
                      false, false) == InvalidOffsetNumber)
!         return false;
!
!     return true;
  }

  /*
Index: src/backend/access/nbtree/nbtpage.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v
retrieving revision 1.123
diff -c -r1.123 nbtpage.c
*** src/backend/access/nbtree/nbtpage.c    6 Jul 2010 19:18:55 -0000    1.123
--- src/backend/access/nbtree/nbtpage.c    28 Aug 2010 23:31:44 -0000
***************
*** 1253,1258 ****
--- 1253,1290 ----
      }

      /*
+      * Check to make sure the index items we're about to delete/overwrite
+      * contain what we expect.  This can fail if the index has become
+      * corrupt for some reason.  We want to throw any error before entering
+      * the critical section --- otherwise it'd be a PANIC.
+      *
+      * The test on the target item is just an Assert because _bt_getstackbuf
+      * should have guaranteed it has the expected contents.  The test on the
+      * next-child downlink is known to fail in the field, though.
+      */
+     page = BufferGetPage(pbuf);
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ #ifdef USE_ASSERT_CHECKING
+     itemid = PageGetItemId(page, poffset);
+     itup = (IndexTuple) PageGetItem(page, itemid);
+     Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
+ #endif
+
+     if (!parent_half_dead)
+     {
+         OffsetNumber nextoffset;
+
+         nextoffset = OffsetNumberNext(poffset);
+         itemid = PageGetItemId(page, nextoffset);
+         itup = (IndexTuple) PageGetItem(page, itemid);
+         if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
+             elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
+                  rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
+                  parent, RelationGetRelationName(rel));
+     }
+
+     /*
       * Here we begin doing the deletion.
       */

***************
*** 1265,1272 ****
       * to copy the right sibling's downlink over the target downlink, and then
       * delete the following item.
       */
-     page = BufferGetPage(pbuf);
-     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
      if (parent_half_dead)
      {
          PageIndexTupleDelete(page, poffset);
--- 1297,1302 ----
***************
*** 1278,1294 ****

          itemid = PageGetItemId(page, poffset);
          itup = (IndexTuple) PageGetItem(page, itemid);
-         Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
          ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);

          nextoffset = OffsetNumberNext(poffset);
-         /* This part is just for double-checking */
-         itemid = PageGetItemId(page, nextoffset);
-         itup = (IndexTuple) PageGetItem(page, itemid);
-         if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
-             elog(PANIC, "right sibling %u of block %u is not next child of %u in index \"%s\"",
-                  rightsib, target, BufferGetBlockNumber(pbuf),
-                  RelationGetRelationName(rel));
          PageIndexTupleDelete(page, nextoffset);
      }

--- 1308,1316 ----

pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
Next
From: Pavel Stehule
Date:
Subject: performance issue: logical operators are slow inside SQL function: missing optimalization?