Re: Brain dump: btree collapsing - Mailing list pgsql-hackers
From | Curtis Faith |
---|---|
Subject | Re: Brain dump: btree collapsing |
Date | |
Msg-id | 000a01c2d452$fd3e4ac0$a200a8c0@curtislaptop Whole thread Raw |
In response to | Re: Brain dump: btree collapsing (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Brain dump: btree collapsing
Re: Brain dump: btree collapsing |
List | pgsql-hackers |
tom lane wrote: > How does that help? The left-moving indexscan still has no > way to recover. It can't go back to the page it was on > before and try to determine which entries you added there, > because it has no reliable reference point to do so. The > entry it previously returned might not be there anymore, and > in a non-unique index key comparisons won't help. And even > more to the point, how would it know you've changed the left > page? It has no idea what the previous "page version" on the > left page was, because it was never there before. Revisiting the idea I proposed in a previous email after more research, I believe it is definitely possible to implement a mutex based scheme that will prevent scans from being polluted by merges of index pages and that does not result in the mutex being held for any significant duration. 1) Current scan code does this: bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) {... definitions go here... if (ScanDirectionIsForward(dir)){ if (!PageIsEmpty(page) && offnum < maxoff) offnum = OffsetNumberNext(offnum); else { /* walk right to the next page with data */ for (;;) { /* if we're at end of scan, release the buffer and return */ ... skip code here... /* step right one page */ blkno = opaque->btpo_next; _bt_relbuf(rel, *bufP); *bufP = _bt_getbuf(rel, blkno, BT_READ); ... skip rest of code... 3) Note that the calls _bt_relbuf(rel, *bufP); *bufP = _bt_getbuf(rel, blkno, BT_READ); a) appear adjacent to each other b) relbuf calls: LockBuffer(buf, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buf); c) getbuf only calls: buf = ReadBuffer(rel, blkno);LockBuffer(buf, access); in the case of an existing buffer, rather than a new one. 4) This could easily be reordered into: buf = ReadBuffer(rel, blkno); /* pin next page */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on current page */LockBuffer(buf, BT_READ); /* lock next page */ ReleaseBuffer(buf); /* now release pin on previously current page */ without affecting the logic of the code or causing any deadlock problems since the release still occurs before the lock of the next page. 5) A mutex/spinlock that was stored in the index could be acquired by the scan code like this: buf = ReadBuffer(rel, blkno); /* pin next page */ SpinLockAcquire( indexSpecificMutex ); /* lock the index reorg mutex */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on current page */LockBuffer(buf, BT_READ); /* lock next page */ SpinLockRelease( indexSpecificMutex ); /* unlock the index reorg mutex */ ReleaseBuffer(buf); /* now release pin on previously current page */ 6) The same index specific mutex/spinlock could be used for the merge code surrounding only the acquisition of the four page locks. This would obviate any problems with scans and page merges, since the lock acquisition for the merge could never occur while a scan was between pages. Further, with the reordering, the spinlock for the scan code doesn't seem particularly onerous since it would be surrounding only two LWLock calls. To reduce the overhead to an absolute minimum for the scan case these could be pushed down into a new IW call (probably necessary since the LockBuffer, ReleaseBuffer code does some error checking and such that one wouldn't want in code guarded by a mutex. - Curtis
pgsql-hackers by date: