Home > mailing lists

Re: Brain dump: btree collapsing - Mailing list pgsql-hackers

From	Curtis Faith
Subject	Re: Brain dump: btree collapsing
Date	February 14, 2003 13:00:48
Msg-id	000a01c2d452$fd3e4ac0$a200a8c0@curtislaptop Whole thread Raw
In response to	Re: Brain dump: btree collapsing (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Brain dump: btree collapsing Re: Brain dump: btree collapsing
List	pgsql-hackers

Tree view

tom lane wrote:
> How does that help?  The left-moving indexscan still has no 
> way to recover.  It can't go back to the page it was on 
> before and try to determine which entries you added there, 
> because it has no reliable reference point to do so.  The 
> entry it previously returned might not be there anymore, and 
> in a non-unique index key comparisons won't help. And even 
> more to the point, how would it know you've changed the left 
> page?  It has no idea what the previous "page version" on the 
> left page was, because it was never there before.

Revisiting the idea I proposed in a previous email after more research,
I believe it is definitely possible to implement a mutex based scheme
that will prevent scans from being polluted by merges of index pages and
that does not result in the mutex being held for any significant
duration.

1) Current scan code does this:

bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{... definitions go here...
if (ScanDirectionIsForward(dir)){    if (!PageIsEmpty(page) && offnum < maxoff)        offnum =
OffsetNumberNext(offnum);   else    {        /* walk right to the next page with data */        for (;;)        {
    /* if we're at end of scan, release the

buffer and return */            ... skip code here...
            /* step right one page */            blkno = opaque->btpo_next;            _bt_relbuf(rel, *bufP);
 *bufP = _bt_getbuf(rel, blkno, BT_READ);

    ... skip rest of code...

3) Note that the calls
   _bt_relbuf(rel, *bufP);   *bufP = _bt_getbuf(rel, blkno, BT_READ);
  a) appear adjacent to each other  b) relbuf calls:
     LockBuffer(buf, BUFFER_LOCK_UNLOCK);     ReleaseBuffer(buf);
  c) getbuf only calls:
     buf = ReadBuffer(rel, blkno);LockBuffer(buf, access);      in the case of an existing buffer, rather than a new
one.

4) This could easily be reordered into:
     buf = ReadBuffer(rel, blkno);            /* pin next page
*/     LockBuffer(buf, BUFFER_LOCK_UNLOCK);    /* release lock on
current page */LockBuffer(buf, BT_READ);                 /* lock next page */     ReleaseBuffer(buf);
   /* now release pin on

previously current page */
  without affecting the logic of the code or causing any deadlock
problems since the release still occurs before the lock of the next
page.

5) A mutex/spinlock that was stored in the index could be acquired by
the scan code like this:
     buf = ReadBuffer(rel, blkno);            /* pin next page
*/
SpinLockAcquire( indexSpecificMutex );    /* lock the index
reorg mutex */
     LockBuffer(buf, BUFFER_LOCK_UNLOCK);    /* release lock on
current page */LockBuffer(buf, BT_READ);                 /* lock next page */
SpinLockRelease( indexSpecificMutex );    /* unlock the index
reorg mutex */     ReleaseBuffer(buf);                       /* now release pin on
previously current page */

6) The same index specific mutex/spinlock could be used for the merge
code surrounding only the acquisition of the four page locks. This would
obviate any problems with scans and page merges, since the lock
acquisition for the merge could never occur while a scan was between
pages.

Further, with the reordering, the spinlock for the scan code doesn't
seem particularly onerous since it would be surrounding only two LWLock
calls.  To reduce the overhead to an absolute minimum for the scan case
these could be pushed down into a new IW call (probably necessary since
the LockBuffer, ReleaseBuffer code does some error checking and such
that one wouldn't want in code guarded by a mutex.

- Curtis

pgsql-hackers by date:

From: Anastassia Ailamaki
Date: 14 February 2003, 12:42:44
Subject: Re: Berkeley and CMU classes adopt/extend PostgreSQL

From: Christopher Browne
Date: 14 February 2003, 13:09:14
Subject: Re: Incremental backup

Re: Brain dump: btree collapsing - Mailing list pgsql-hackers

Previous

Next