Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC] - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
Date
Msg-id 552.1473445163@sss.pgh.pa.us
Whole thread Raw
In response to Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Here's a draft patch that replaces BRIN's use of PageIndexDeleteNoCompact
and an immediately following PageAddItemExtended with
PageIndexTupleOverwrite.  It turns out that there are two use-cases in
BRIN for PageIndexDeleteNoCompact, and the other one can't be replaced
because while there is a PageAddItem call beside it, that one is for a
different page.  But this is still enough to let us get rid of
PAI_ALLOW_FAR_OFFSET, which seems like a good thing.

Actually, with this patch, PageAddItemExtended isn't directly referenced
anywhere, so we could revert that API and fold it back into PageAddItem,
saving a nanosecond or two per call.  I'm inclined to leave that alone,
though, since I agree with the underlying thought that any future
extensions to PageAddItem's functionality would be better handled by more
bits in a flags bitmask.  Maybe a better idea is to get rid of the call
overhead by turning PageAddItem into a macro.  In any case, doing
something about that could be a separate patch.

It's also kind of tempting to rewrite PageIndexDeleteNoCompact into a
function that only deletes one tuple, and presumably is simpler and faster
than what's there.  But maybe there's something coming down the pike that
needs the extra generality?  That should also be a separate patch, anyway.

This passes make check-world but I'm not sure how heavily that stresses
BRIN ... are there any other tests you'd want to see run before
committing?

            regards, tom lane

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 6ebfedd..24ae38b 100644
*** a/src/backend/access/brin/brin_pageops.c
--- b/src/backend/access/brin/brin_pageops.c
*************** brin_doupdate(Relation idxrel, BlockNumb
*** 178,187 ****
          }

          START_CRIT_SECTION();
!         PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
!         if (PageAddItemExtended(oldpage, (Item) newtup, newsz, oldoff,
!                 PAI_OVERWRITE | PAI_ALLOW_FAR_OFFSET) == InvalidOffsetNumber)
!             elog(ERROR, "failed to add BRIN tuple");
          MarkBufferDirty(oldbuf);

          /* XLOG stuff */
--- 178,185 ----
          }

          START_CRIT_SECTION();
!         if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) newtup, newsz))
!             elog(ERROR, "failed to replace BRIN tuple");
          MarkBufferDirty(oldbuf);

          /* XLOG stuff */
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 27ba0a9..ed14729 100644
*** a/src/backend/access/brin/brin_xlog.c
--- b/src/backend/access/brin/brin_xlog.c
*************** brin_xlog_samepage_update(XLogReaderStat
*** 189,202 ****
          page = (Page) BufferGetPage(buffer);

          offnum = xlrec->offnum;
-         if (PageGetMaxOffsetNumber(page) + 1 < offnum)
-             elog(PANIC, "brin_xlog_samepage_update: invalid max offset number");

!         PageIndexDeleteNoCompact(page, &offnum, 1);
!         offnum = PageAddItemExtended(page, (Item) brintuple, tuplen, offnum,
!                                      PAI_OVERWRITE | PAI_ALLOW_FAR_OFFSET);
!         if (offnum == InvalidOffsetNumber)
!             elog(PANIC, "brin_xlog_samepage_update: failed to add tuple");

          PageSetLSN(page, lsn);
          MarkBufferDirty(buffer);
--- 189,197 ----
          page = (Page) BufferGetPage(buffer);

          offnum = xlrec->offnum;

!         if (!PageIndexTupleOverwrite(page, offnum, (Item) brintuple, tuplen))
!             elog(PANIC, "brin_xlog_samepage_update: failed to replace tuple");

          PageSetLSN(page, lsn);
          MarkBufferDirty(buffer);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index bce0d53..08e222e 100644
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
*************** PageIsVerified(Page page, BlockNumber bl
*** 166,186 ****
   *    inserted, or InvalidOffsetNumber if the item is not inserted for any
   *    reason.  A WARNING is issued indicating the reason for the refusal.
   *
!  *    If flag PAI_OVERWRITE is set, we just store the item at the specified
!  *    offsetNumber (which must be either a currently-unused item pointer,
!  *    or one past the last existing item).  Otherwise,
!  *    if offsetNumber is valid and <= current max offset in the page,
!  *    insert item into the array at that position by shuffling ItemId's
!  *    down to make room.
!  *    If offsetNumber is not valid, then assign one by finding the first
   *    one that is both unused and deallocated.
   *
   *    If flag PAI_IS_HEAP is set, we enforce that there can't be more than
   *    MaxHeapTuplesPerPage line pointers on the page.
   *
-  *    If flag PAI_ALLOW_FAR_OFFSET is not set, we disallow placing items
-  *    beyond one past the last existing item.
-  *
   *    !!! EREPORT(ERROR) IS DISALLOWED HERE !!!
   */
  OffsetNumber
--- 166,189 ----
   *    inserted, or InvalidOffsetNumber if the item is not inserted for any
   *    reason.  A WARNING is issued indicating the reason for the refusal.
   *
!  *    offsetNumber must be either InvalidOffsetNumber to specify finding a
!  *    free item pointer, or a value between FirstOffsetNumber and one past
!  *    the last existing item, to specify using that particular item pointer.
!  *
!  *    If offsetNumber is valid and flag PAI_OVERWRITE is set, we just store
!  *    the item at the specified offsetNumber, which must be either a
!  *    currently-unused item pointer, or one past the last existing item.
!  *
!  *    If offsetNumber is valid and flag PAI_OVERWRITE is not set, insert
!  *    the item at the specified offsetNumber, moving existing items later
!  *    in the array to make room.
!  *
!  *    If offsetNumber is not valid, then assign a slot by finding the first
   *    one that is both unused and deallocated.
   *
   *    If flag PAI_IS_HEAP is set, we enforce that there can't be more than
   *    MaxHeapTuplesPerPage line pointers on the page.
   *
   *    !!! EREPORT(ERROR) IS DISALLOWED HERE !!!
   */
  OffsetNumber
*************** PageAddItemExtended(Page page,
*** 267,277 ****
          }
      }

!     /*
!      * Reject placing items beyond the first unused line pointer, unless
!      * caller asked for that behavior specifically.
!      */
!     if ((flags & PAI_ALLOW_FAR_OFFSET) == 0 && offsetNumber > limit)
      {
          elog(WARNING, "specified item offset is too large");
          return InvalidOffsetNumber;
--- 270,277 ----
          }
      }

!     /* Reject placing items beyond the first unused line pointer */
!     if (offsetNumber > limit)
      {
          elog(WARNING, "specified item offset is too large");
          return InvalidOffsetNumber;
*************** PageAddItemExtended(Page page,
*** 290,299 ****
       * Note: do arithmetic as signed ints, to avoid mistakes if, say,
       * alignedSize > pd_upper.
       */
!     if ((flags & PAI_ALLOW_FAR_OFFSET) != 0)
!         lower = Max(phdr->pd_lower,
!                     SizeOfPageHeaderData + sizeof(ItemIdData) * offsetNumber);
!     else if (offsetNumber == limit || needshuffle)
          lower = phdr->pd_lower + sizeof(ItemIdData);
      else
          lower = phdr->pd_lower;
--- 290,296 ----
       * Note: do arithmetic as signed ints, to avoid mistakes if, say,
       * alignedSize > pd_upper.
       */
!     if (offsetNumber == limit || needshuffle)
          lower = phdr->pd_lower + sizeof(ItemIdData);
      else
          lower = phdr->pd_lower;
*************** PageIndexDeleteNoCompact(Page page, Offs
*** 1093,1098 ****
--- 1090,1202 ----
      }
  }

+
+ /*
+  * PageIndexTupleOverwrite
+  *
+  * Replace a specified tuple on an index page.
+  *
+  * The new tuple is placed exactly where the old one had been, shifting
+  * other tuples' data up or down as needed to keep the page compacted.
+  * This is better than deleting and reinserting the tuple, because it
+  * avoids any data shifting when the tuple size doesn't change; and
+  * even when it does, we avoid moving the item pointers around.
+  * Conceivably this could also be of use to an index AM that cares about
+  * the physical order of tuples as well as their ItemId order.
+  *
+  * If there's insufficient space for the new tuple, return false.  Other
+  * errors represent data-corruption problems, so we just elog.
+  */
+ bool
+ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
+                         Item newtup, Size newsize)
+ {
+     PageHeader    phdr = (PageHeader) page;
+     ItemId        tupid;
+     int            oldsize;
+     unsigned    offset;
+     Size        alignednewsize;
+     int            size_diff;
+     int            itemcount;
+
+     /*
+      * As with PageRepairFragmentation, paranoia seems justified.
+      */
+     if (phdr->pd_lower < SizeOfPageHeaderData ||
+         phdr->pd_lower > phdr->pd_upper ||
+         phdr->pd_upper > phdr->pd_special ||
+         phdr->pd_special > BLCKSZ ||
+         phdr->pd_special != MAXALIGN(phdr->pd_special))
+         ereport(ERROR,
+                 (errcode(ERRCODE_DATA_CORRUPTED),
+                  errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+                         phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+     itemcount = PageGetMaxOffsetNumber(page);
+     if ((int) offnum <= 0 || (int) offnum > itemcount)
+         elog(ERROR, "invalid index offnum: %u", offnum);
+
+     tupid = PageGetItemId(page, offnum);
+     Assert(ItemIdHasStorage(tupid));
+     oldsize = ItemIdGetLength(tupid);
+     offset = ItemIdGetOffset(tupid);
+
+     if (offset < phdr->pd_upper || (offset + oldsize) > phdr->pd_special ||
+         offset != MAXALIGN(offset))
+         ereport(ERROR,
+                 (errcode(ERRCODE_DATA_CORRUPTED),
+                  errmsg("corrupted item pointer: offset = %u, size = %u",
+                         offset, (unsigned int) oldsize)));
+
+     /*
+      * Determine actual change in space requirement, check for page overflow.
+      */
+     oldsize = MAXALIGN(oldsize);
+     alignednewsize = MAXALIGN(newsize);
+     if (alignednewsize > oldsize + (phdr->pd_upper - phdr->pd_lower))
+         return false;
+
+     /*
+      * Relocate existing data and update line pointers, unless the new tuple
+      * is the same size as the old (after alignment), in which case there's
+      * nothing to do.  Notice that what we have to relocate is data before the
+      * target tuple, not data after, so it's convenient to express size_diff
+      * as the amount by which the tuple's size is decreasing, making it the
+      * delta to add to pd_upper and affected line pointers.
+      */
+     size_diff = oldsize - (int) alignednewsize;
+     if (size_diff != 0)
+     {
+         char       *addr = (char *) page + phdr->pd_upper;
+         int            i;
+
+         /* relocate all tuple data before the target tuple */
+         memmove(addr + size_diff, addr, offset - phdr->pd_upper);
+
+         /* adjust free space boundary pointer */
+         phdr->pd_upper += size_diff;
+
+         /* adjust affected line pointers too */
+         for (i = FirstOffsetNumber; i <= itemcount; i++)
+         {
+             ItemId        ii = PageGetItemId(phdr, i);
+
+             /* Allow items without storage; currently only BRIN needs that */
+             if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
+                 ii->lp_off += size_diff;
+         }
+     }
+
+     /* Update the item's tuple length (other fields shouldn't change) */
+     ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+     /* Copy new tuple data onto page */
+     memcpy(PageGetItem(page, tupid), newtup, newsize);
+
+     return true;
+ }
+
+
  /*
   * Set checksum for a page in shared buffers.
   *
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..0ea47f5 100644
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
*************** do { \
*** 409,415 ****
   */
  #define PAI_OVERWRITE            (1 << 0)
  #define PAI_IS_HEAP                (1 << 1)
- #define PAI_ALLOW_FAR_OFFSET    (1 << 2)

  extern void PageInit(Page page, Size pageSize, Size specialSize);
  extern bool PageIsVerified(Page page, BlockNumber blkno);
--- 409,414 ----
*************** extern void PageIndexTupleDelete(Page pa
*** 429,434 ****
--- 428,435 ----
  extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
  extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
                           int nitems);
+ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
+                         Item newtup, Size newsize);
  extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
  extern void PageSetChecksumInplace(Page page, BlockNumber blkno);


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Use LEFT JOINs in some system views in case referenced row doesn
Next
From: Greg Stark
Date:
Subject: Re: GiST penalty functions [PoC]