Thread: BTree vacuum before page splitting

BTree vacuum before page splitting

From

Junji TERAMOTO

Date:

27 January 2006, 06:37:57

Hi all,

This patch adds a function to remove unnecessary items before split
page of BTree.

When a new item is put in the page, it looks for the "LP_DELETE" item,
and removes that item. No heap access is required. Moreover, the penalty
is also few because it operates instead of page split.

Only when the super-exclusive lock was able to obtain, it removes
items because the deletion of the item might disturb the simultaneous
execution of the index scanning.
It is an optimistic control.

Because own scan might be lost even if the super-exclusive lock can
obtain, we also modify _bt_restscan().

The growth of the index frequently updated by using this patch can be
suppressed.

Your comments are welcome. Thanks.

----
For example(pgbench):
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 50
number of transactions per client: 5000
number of transactions actually processed: 250000/250000

[before]
Relation Name         File Name Tuples   Pages File Size
public.accounts_pkey  16568     10001200 21899 175192
public.branches_pkey  16564          100   410   3280
public.tellers_pkey   16566         1000   374   2992

[after]
Relation Name         File Name Tuples   Pages File Size
public.accounts_pkey  16688     10004100 21899 175192
public.branches_pkey  16684          100    44    352 <==
public.tellers_pkey   16686         1000    26    208 <==
----

We also test this patch on DBT-2.

--
Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp
NTT Cyber Space Lab.
diff -cpr pgsql-orig/src/backend/access/nbtree/nbtinsert.c pgsql/src/backend/access/nbtree/nbtinsert.c
*** pgsql-orig/src/backend/access/nbtree/nbtinsert.c    2006-01-24 21:39:25.000000000 +0900
--- pgsql/src/backend/access/nbtree/nbtinsert.c    2006-01-25 05:07:56.000000000 +0900
*************** static void _bt_pgaddtup(Relation rel, P
*** 62,67 ****
--- 62,68 ----
               OffsetNumber itup_off, const char *where);
  static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
              int keysz, ScanKey scankey);
+ static int    _bt_vacuum_page(Relation rel, Buffer buf);


  /*
*************** _bt_check_unique(Relation rel, BTItem bt
*** 260,265 ****
--- 261,267 ----
                                                   hbuffer) == HEAPTUPLE_DEAD)
                      {
                          curitemid->lp_flags |= LP_DELETE;
+                         opaque->btpo_flags |= BTP_HAS_DEAD;
                          if (nbuf != InvalidBuffer)
                              SetBufferCommitInfoNeedsSave(nbuf);
                          else
*************** _bt_insertonpg(Relation rel,
*** 424,433 ****
           */
          bool        movedright = false;

!         while (PageGetFreeSpace(page) < itemsz &&
!                !P_RIGHTMOST(lpageop) &&
!                _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0 &&
!                random() > (MAX_RANDOM_VALUE / 100))
          {
              /*
               * step right to next non-dead page
--- 426,432 ----
           */
          bool        movedright = false;

!         while (PageGetFreeSpace(page) < itemsz)
          {
              /*
               * step right to next non-dead page
*************** _bt_insertonpg(Relation rel,
*** 440,445 ****
--- 439,457 ----
               */
              Buffer        rbuf = InvalidBuffer;

+             if (P_ISLEAF(lpageop) && P_HAS_DEAD(lpageop) &&
+                 !rel->rd_istemp && BufferSuperLockHeldByMe(buf))
+             {
+                 _bt_vacuum_page(rel, buf);
+                 if (PageGetFreeSpace(page) >= itemsz)
+                     break;
+             }
+
+             if (P_RIGHTMOST(lpageop) ||
+                 _bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
+                 random() <= (MAX_RANDOM_VALUE / 100))
+                 break;
+
              for (;;)
              {
                  BlockNumber rblkno = lpageop->btpo_next;
*************** _bt_isequal(TupleDesc itupdesc, Page pag
*** 1590,1592 ****
--- 1602,1651 ----
      /* if we get here, the keys are equal */
      return true;
  }
+
+ /*
+  * _bt_vacuum_page
+  *
+  * @param buf    Must have super-exclusive-lock.
+  */
+ static int
+ _bt_vacuum_page(Relation rel, Buffer buf)
+ {
+     OffsetNumber    deletable[MaxOffsetNumber];
+     BTPageOpaque    opaque;
+     OffsetNumber    offnum, minoff, maxoff;
+     int                ndeletable;
+     Page            page = BufferGetPage(buf);
+
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+     if (!P_ISLEAF(opaque) || !P_HAS_DEAD(opaque))
+         return -1;
+
+     minoff = P_FIRSTDATAKEY(opaque);
+     maxoff = PageGetMaxOffsetNumber(page);
+     if (minoff > maxoff)
+         return -1;
+
+     ndeletable = 0;
+
+     for (offnum = minoff;
+          offnum <= maxoff;
+          offnum = OffsetNumberNext(offnum))
+     {
+         ItemId itemId = PageGetItemId(page, offnum);
+         if (!ItemIdIsUsed(itemId) || ItemIdDeleted(itemId))
+         {
+             deletable[ndeletable++] = offnum;
+         }
+     }
+
+     if (ndeletable > 0)
+     {
+         _bt_delitems(rel, buf, deletable, ndeletable);
+     }
+
+     /* clear the dead tuple hint bit */
+     opaque->btpo_flags &= ~BTP_HAS_DEAD;
+
+     return ndeletable;
+ }
diff -cpr pgsql-orig/src/backend/access/nbtree/nbtree.c pgsql/src/backend/access/nbtree/nbtree.c
*** pgsql-orig/src/backend/access/nbtree/nbtree.c    2006-01-24 21:39:25.000000000 +0900
--- pgsql/src/backend/access/nbtree/nbtree.c    2006-01-25 05:30:06.000000000 +0900
*************** btgettuple(PG_FUNCTION_ARGS)
*** 271,276 ****
--- 271,277 ----
              offnum = ItemPointerGetOffsetNumber(&(scan->currentItemData));
              page = BufferGetPage(so->btso_curbuf);
              PageGetItemId(page, offnum)->lp_flags |= LP_DELETE;
+             ((BTPageOpaque) PageGetSpecialPointer(page))->btpo_flags |= BTP_HAS_DEAD;

              /*
               * Since this can be redone later if needed, it's treated the same
*************** btbulkdelete(PG_FUNCTION_ARGS)
*** 632,637 ****
--- 633,641 ----
                  }
              }

+             /* clear the dead tuple hint bit */
+             opaque->btpo_flags &= ~BTP_HAS_DEAD;
+
              /*
               * If we need to delete anything, do it and write the buffer; else
               * just release the buffer.
*************** _bt_restscan(IndexScanDesc scan)
*** 898,903 ****
--- 902,929 ----
          return;
      }

+     /* Check for item on this page */
+     if (offnum <= maxoff)
+     {
+         for (;
+              offnum <= maxoff;
+              offnum = OffsetNumberNext(offnum))
+         {
+             item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
+             if (BTTidSame(item->bti_itup.t_tid, *target))
+             {
+                 /* Found it */
+                 current->ip_posid = offnum;
+                 return;
+             }
+         }
+
+         /* Move offsetnumber to first, because we may delete LP_DELETE tuple */
+         maxoff = ItemPointerGetOffsetNumber(current) - 1;
+     }
+
+     offnum = P_FIRSTDATAKEY(opaque);
+
      /*
       * The item we were on may have moved right due to insertions. Find it
       * again.  We use the heap TID to identify the item uniquely.
diff -cpr pgsql-orig/src/backend/storage/buffer/bufmgr.c pgsql/src/backend/storage/buffer/bufmgr.c
*** pgsql-orig/src/backend/storage/buffer/bufmgr.c    2006-01-24 21:39:25.000000000 +0900
--- pgsql/src/backend/storage/buffer/bufmgr.c    2006-01-25 05:19:50.000000000 +0900
*************** LockBufferForCleanup(Buffer buffer)
*** 1880,1885 ****
--- 1880,1904 ----
  }

  /*
+  * BufferSuperLockHeldByMe
+  *
+  * @param buffer    Must have BUFFER_LOCK_EXCLUSIVE.
+  */
+ bool
+ BufferSuperLockHeldByMe(Buffer buffer)
+ {
+     volatile BufferDesc *bufHdr;
+     unsigned    refcount;
+
+     bufHdr = &BufferDescriptors[buffer - 1];
+     LockBufHdr(bufHdr);
+     refcount = bufHdr->refcount;
+     UnlockBufHdr(bufHdr);
+
+     return refcount == 1;
+ }
+
+ /*
   *    Functions for buffer I/O handling
   *
   *    Note: We assume that nested buffer I/O never occurs.
diff -cpr pgsql-orig/src/include/access/nbtree.h pgsql/src/include/access/nbtree.h
*** pgsql-orig/src/include/access/nbtree.h    2006-01-24 21:39:25.000000000 +0900
--- pgsql/src/include/access/nbtree.h    2006-01-25 04:58:05.000000000 +0900
*************** typedef BTPageOpaqueData *BTPageOpaque;
*** 55,60 ****
--- 55,61 ----
  #define BTP_DELETED        (1 << 2)    /* page has been deleted from tree */
  #define BTP_META        (1 << 3)    /* meta-page */
  #define BTP_HALF_DEAD    (1 << 4)    /* empty, but still in tree */
+ #define BTP_HAS_DEAD    (1 << 5)    /* page has dead tuples */


  /*
*************** typedef BTItemData *BTItem;
*** 155,160 ****
--- 156,162 ----
  #define P_ISROOT(opaque)        ((opaque)->btpo_flags & BTP_ROOT)
  #define P_ISDELETED(opaque)        ((opaque)->btpo_flags & BTP_DELETED)
  #define P_IGNORE(opaque)        ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
+ #define P_HAS_DEAD(opaque)        ((opaque)->btpo_flags & BTP_HAS_DEAD)

  /*
   *    Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
diff -cpr pgsql-orig/src/include/storage/bufmgr.h pgsql/src/include/storage/bufmgr.h
*** pgsql-orig/src/include/storage/bufmgr.h    2006-01-24 21:39:25.000000000 +0900
--- pgsql/src/include/storage/bufmgr.h    2006-01-25 05:06:04.000000000 +0900
*************** extern void UnlockBuffers(void);
*** 149,154 ****
--- 149,155 ----
  extern void LockBuffer(Buffer buffer, int mode);
  extern bool ConditionalLockBuffer(Buffer buffer);
  extern void LockBufferForCleanup(Buffer buffer);
+ extern bool BufferSuperLockHeldByMe(Buffer buffer);

  extern void AbortBufferIO(void);

Re: BTree vacuum before page splitting

From

Tom Lane

Date:

27 January 2006, 11:17:00

Junji TERAMOTO <teramoto.junji@lab.ntt.co.jp> writes:
> This patch adds a function to remove unnecessary items before split
> page of BTree.

> When a new item is put in the page, it looks for the "LP_DELETE" item,
> and removes that item.

I think this is quite likely to break things :-(.  What sort of
conditions have you tested it under?  (If this were safe, we'd
not have invented the LP_DELETE flag to begin with, but just have
deleted known-dead items immediately.)

BTW, a patch that makes major changes in the behavior of a function
and doesn't update the code comments is unacceptable.  The comments
are as important as the code, if not more so, because other people
are going to have to read this after you.

            regards, tom lane

Re: BTree vacuum before page splitting

From

Junji TERAMOTO

Date:

31 January 2006, 02:17:40

Tom Lane wrote:
> I think this is quite likely to break things :-(.  What sort of
> conditions have you tested it under?  (If this were safe, we'd
> not have invented the LP_DELETE flag to begin with, but just have
> deleted known-dead items immediately.)

I apologize for my insufficient description(and bad english).
I explain the operation of this patch in detail again.

In _bt_insertonpg(), we insert the tupple by the following methods.

 (1) Finds the right place(page) to insert the tuple.
 (2) If free space is insufficient, we operate as follows.
+ (2.1) Confirm whether the lock to the found page is considerable
        super-exclusive lock in BufferSuperLockHeldByMe()[new]. We
        check bufHdr->refcount(=pin), and if pin == 1, we know this
        lock is super-exclusive lock.
        If our lock is not super-exclusive lock, goto (2.4).
+ (2.2) If our lock is super-exclusive lock, and the found page is
        leaf, we look for tupples marked as "LP_DELETE" from the found
        page, and remove found tuples in _bt_vacuum_page()[new].
+ (2.3) If free space is sufficient after removal, goto (4).
  (2.4) Step right to next non-dead page.
  (2.5) (2) is repeated until an enough free space is found or it
        reaches a right edge at last.
 (3) When an enough free space is not found by processing (2), splits
     the target page (making sure that the split is equitable as far as
     post-insert free space goes).
 (4) Inserts the tuple.

The numbers from 2.1) to 2.3) are new processing.

The pointer's shifting by removing "LP_DELETE" tuples as shown in the
above-mentioned becomes a problem, when we resume a stopped indexscan.
Therefore, I am having it confirm the state of the lock by (2.1), and
process only at super-exclucive lock, same as btbulkdelete().

However, because own indexscan might be lost because of this removal,
I also modify _bt_restscan() as follows.

 (1) Check for index tuple we stopped on.
 (2) If index tuple is moved, first of all, we search in this page
     right.
+(3) If not found, we try to look for from the head of this page
     because index tuple may moved left due to removal.
 (4) If still not found, we look for index tuple right.

The number (3) is new processing.

We do not delete the empty page. Therefore, there is no necessity for
tracing a left page.

I think that no problem occurs by this logic.
Please point it out if there is a wrong point. Thanks.

(I think that we might do not have to obtain super-exclucive lock by
processing (3) in _bt_restscan(), but it needs careful to do that.)

> BTW, a patch that makes major changes in the behavior of a function
> and doesn't update the code comments is unacceptable.  The comments
> are as important as the code, if not more so, because other people
> are going to have to read this after you.

I'm sorry. I will write the comment more, and I will contribute the
patch again.

--
Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp

Re: BTree vacuum before page splitting

From

Simon Riggs

Date:

01 February 2006, 03:11:48

On Tue, 2006-01-31 at 15:18 +0900, Junji TERAMOTO wrote:
> Tom Lane wrote:
> > I think this is quite likely to break things :-(.  What sort of
> > conditions have you tested it under?  (If this were safe, we'd
> > not have invented the LP_DELETE flag to begin with, but just have
> > deleted known-dead items immediately.)
>
> I apologize for my insufficient description(and bad english).
> I explain the operation of this patch in detail again.
>
> In _bt_insertonpg(), we insert the tupple by the following methods.
>
>  (1) Finds the right place(page) to insert the tuple.
>  (2) If free space is insufficient, we operate as follows.
> + (2.1) Confirm whether the lock to the found page is considerable
>         super-exclusive lock in BufferSuperLockHeldByMe()[new]. We
>         check bufHdr->refcount(=pin), and if pin == 1, we know this
>         lock is super-exclusive lock.
>         If our lock is not super-exclusive lock, goto (2.4).
> + (2.2) If our lock is super-exclusive lock, and the found page is
>         leaf, we look for tupples marked as "LP_DELETE" from the found
>         page, and remove found tuples in _bt_vacuum_page()[new].
> + (2.3) If free space is sufficient after removal, goto (4).
>   (2.4) Step right to next non-dead page.
>   (2.5) (2) is repeated until an enough free space is found or it
>         reaches a right edge at last.
>  (3) When an enough free space is not found by processing (2), splits
>      the target page (making sure that the split is equitable as far as
>      post-insert free space goes).
>  (4) Inserts the tuple.
>
> The numbers from 2.1) to 2.3) are new processing.
>
> The pointer's shifting by removing "LP_DELETE" tuples as shown in the
> above-mentioned becomes a problem, when we resume a stopped indexscan.
> Therefore, I am having it confirm the state of the lock by (2.1), and
> process only at super-exclucive lock, same as btbulkdelete().
>
> However, because own indexscan might be lost because of this removal,
> I also modify _bt_restscan() as follows.
>
>  (1) Check for index tuple we stopped on.
>  (2) If index tuple is moved, first of all, we search in this page
>      right.
> +(3) If not found, we try to look for from the head of this page
>      because index tuple may moved left due to removal.
>  (4) If still not found, we look for index tuple right.
>
> The number (3) is new processing.
>
> We do not delete the empty page. Therefore, there is no necessity for
> tracing a left page.
>
> I think that no problem occurs by this logic.
> Please point it out if there is a wrong point. Thanks.

Please read comments in nbtree.c p.557-566 which describe the required
locking protocol for btree pages.

Just because you hold the lock does *not* mean you are allowed to delete
tuples marked as deleted. This patch doesn't work in all cases, which is
what is required, since the failure cases don't raise an ERROR - they
just miss data - which is unacceptable. So your patch works 99.9% of the
time, but sometimes gives wrong answers to queries without you knowing.
So, patch rejected, but good thinking all the same.

You will need to argue for a change in the locking protocol before it
could be accepted. That would speed up VACUUM also, so if it were
possible it would be gratefully accepted.

The only help I can give is this: Why does an index scan ever want to
stop on a deleted tuple? Currently, the scan only remembers one tuple
its seen. If the tuple was dead, but wasn't yet marked as dead the index
scan will have read it and stopped there. Some while later, the index
scan will return and begin scanning right for the tuple. If you delete
it, the index scan *may* not be able to work out where to restart --
ERROR. So, if you can work out a way of not restarting an index scan on
a deleted tuple (and yet still marking them as deleted when it sees
them) then you may be able to solve the problem.

I'm looking at the same problem domain also (table bloat), but focusing
on a different approach that avoids this issue.

Best Regards, Simon Riggs

Re: BTree vacuum before page splitting

From

Junji TERAMOTO

Date:

01 February 2006, 06:43:57

Thanks!

I had misunderstood the content of the comment. I think that I can
understand the problem now.

I might be able to present the solution of the restarting an index scan.
I try to correct the patch.

Simon Riggs wrote:
> On Tue, 2006-01-31 at 15:18 +0900, Junji TERAMOTO wrote:
>> Tom Lane wrote:
>>> I think this is quite likely to break things :-(.  What sort of
>>> conditions have you tested it under?  (If this were safe, we'd
>>> not have invented the LP_DELETE flag to begin with, but just have
>>> deleted known-dead items immediately.)
>> I apologize for my insufficient description(and bad english).
>> I explain the operation of this patch in detail again.
>>
>> In _bt_insertonpg(), we insert the tupple by the following methods.
>>
>>  (1) Finds the right place(page) to insert the tuple.
>>  (2) If free space is insufficient, we operate as follows.
>> + (2.1) Confirm whether the lock to the found page is considerable
>>         super-exclusive lock in BufferSuperLockHeldByMe()[new]. We
>>         check bufHdr->refcount(=pin), and if pin == 1, we know this
>>         lock is super-exclusive lock.
>>         If our lock is not super-exclusive lock, goto (2.4).
>> + (2.2) If our lock is super-exclusive lock, and the found page is
>>         leaf, we look for tupples marked as "LP_DELETE" from the found
>>         page, and remove found tuples in _bt_vacuum_page()[new].
>> + (2.3) If free space is sufficient after removal, goto (4).
>>   (2.4) Step right to next non-dead page.
>>   (2.5) (2) is repeated until an enough free space is found or it
>>         reaches a right edge at last.
>>  (3) When an enough free space is not found by processing (2), splits
>>      the target page (making sure that the split is equitable as far as
>>      post-insert free space goes).
>>  (4) Inserts the tuple.
>>
>> The numbers from 2.1) to 2.3) are new processing.
>>
>> The pointer's shifting by removing "LP_DELETE" tuples as shown in the
>> above-mentioned becomes a problem, when we resume a stopped indexscan.
>> Therefore, I am having it confirm the state of the lock by (2.1), and
>> process only at super-exclucive lock, same as btbulkdelete().
>>
>> However, because own indexscan might be lost because of this removal,
>> I also modify _bt_restscan() as follows.
>>
>>  (1) Check for index tuple we stopped on.
>>  (2) If index tuple is moved, first of all, we search in this page
>>      right.
>> +(3) If not found, we try to look for from the head of this page
>>      because index tuple may moved left due to removal.
>>  (4) If still not found, we look for index tuple right.
>>
>> The number (3) is new processing.
>>
>> We do not delete the empty page. Therefore, there is no necessity for
>> tracing a left page.
>>
>> I think that no problem occurs by this logic.
>> Please point it out if there is a wrong point. Thanks.
>
> Please read comments in nbtree.c p.557-566 which describe the required
> locking protocol for btree pages.
>
> Just because you hold the lock does *not* mean you are allowed to delete
> tuples marked as deleted. This patch doesn't work in all cases, which is
> what is required, since the failure cases don't raise an ERROR - they
> just miss data - which is unacceptable. So your patch works 99.9% of the
> time, but sometimes gives wrong answers to queries without you knowing.
> So, patch rejected, but good thinking all the same.
>
> You will need to argue for a change in the locking protocol before it
> could be accepted. That would speed up VACUUM also, so if it were
> possible it would be gratefully accepted.
>
> The only help I can give is this: Why does an index scan ever want to
> stop on a deleted tuple? Currently, the scan only remembers one tuple
> its seen. If the tuple was dead, but wasn't yet marked as dead the index
> scan will have read it and stopped there. Some while later, the index
> scan will return and begin scanning right for the tuple. If you delete
> it, the index scan *may* not be able to work out where to restart --
> ERROR. So, if you can work out a way of not restarting an index scan on
> a deleted tuple (and yet still marking them as deleted when it sees
> them) then you may be able to solve the problem.
>
> I'm looking at the same problem domain also (table bloat), but focusing
> on a different approach that avoids this issue.
>
> Best Regards, Simon Riggs


--
Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp