Re: Brain dump: btree collapsing - Mailing list pgsql-hackers

From Curtis Faith
Subject Re: Brain dump: btree collapsing
Date
Msg-id 000a01c2d452$fd3e4ac0$a200a8c0@curtislaptop
Whole thread Raw
In response to Re: Brain dump: btree collapsing  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Brain dump: btree collapsing
Re: Brain dump: btree collapsing
List pgsql-hackers
tom lane wrote:
> How does that help?  The left-moving indexscan still has no 
> way to recover.  It can't go back to the page it was on 
> before and try to determine which entries you added there, 
> because it has no reliable reference point to do so.  The 
> entry it previously returned might not be there anymore, and 
> in a non-unique index key comparisons won't help. And even 
> more to the point, how would it know you've changed the left 
> page?  It has no idea what the previous "page version" on the 
> left page was, because it was never there before.

Revisiting the idea I proposed in a previous email after more research,
I believe it is definitely possible to implement a mutex based scheme
that will prevent scans from being polluted by merges of index pages and
that does not result in the mutex being held for any significant
duration.

1) Current scan code does this:

bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{... definitions go here...
if (ScanDirectionIsForward(dir)){    if (!PageIsEmpty(page) && offnum < maxoff)        offnum =
OffsetNumberNext(offnum);   else    {        /* walk right to the next page with data */        for (;;)        {
    /* if we're at end of scan, release the
 
buffer and return */            ... skip code here...
            /* step right one page */            blkno = opaque->btpo_next;            _bt_relbuf(rel, *bufP);
 *bufP = _bt_getbuf(rel, blkno, BT_READ);
 
    ... skip rest of code...

3) Note that the calls
   _bt_relbuf(rel, *bufP);   *bufP = _bt_getbuf(rel, blkno, BT_READ);
  a) appear adjacent to each other  b) relbuf calls:
     LockBuffer(buf, BUFFER_LOCK_UNLOCK);     ReleaseBuffer(buf);
  c) getbuf only calls:
     buf = ReadBuffer(rel, blkno);LockBuffer(buf, access);      in the case of an existing buffer, rather than a new
one.

4) This could easily be reordered into:
     buf = ReadBuffer(rel, blkno);            /* pin next page
*/     LockBuffer(buf, BUFFER_LOCK_UNLOCK);    /* release lock on
current page */LockBuffer(buf, BT_READ);                 /* lock next page */     ReleaseBuffer(buf);
   /* now release pin on
 
previously current page */
  without affecting the logic of the code or causing any deadlock
problems since the release still occurs before the lock of the next
page.

5) A mutex/spinlock that was stored in the index could be acquired by
the scan code like this:
     buf = ReadBuffer(rel, blkno);            /* pin next page
*/
SpinLockAcquire( indexSpecificMutex );    /* lock the index
reorg mutex */
     LockBuffer(buf, BUFFER_LOCK_UNLOCK);    /* release lock on
current page */LockBuffer(buf, BT_READ);                 /* lock next page */
SpinLockRelease( indexSpecificMutex );    /* unlock the index
reorg mutex */     ReleaseBuffer(buf);                       /* now release pin on
previously current page */

6) The same index specific mutex/spinlock could be used for the merge
code surrounding only the acquisition of the four page locks. This would
obviate any problems with scans and page merges, since the lock
acquisition for the merge could never occur while a scan was between
pages.

Further, with the reordering, the spinlock for the scan code doesn't
seem particularly onerous since it would be surrounding only two LWLock
calls.  To reduce the overhead to an absolute minimum for the scan case
these could be pushed down into a new IW call (probably necessary since
the LockBuffer, ReleaseBuffer code does some error checking and such
that one wouldn't want in code guarded by a mutex.

- Curtis




pgsql-hackers by date:

Previous
From: Anastassia Ailamaki
Date:
Subject: Re: Berkeley and CMU classes adopt/extend PostgreSQL
Next
From: Christopher Browne
Date:
Subject: Re: Incremental backup