On Mon, 2008-10-20 at 16:11 +0100, Simon Riggs wrote:
> On Mon, 2008-10-20 at 18:24 +0400, Teodor Sigaev wrote:
> > > 3. Implement an extra indexAM API call that allows indexAM to decide
> > > when/if index is valid during recovery. This would also cover the second
> > > concern neatly in a single API call.
> > >
> > > wait until after deadline to implement (2) or (3), in case somebody
> > > fixes this up in the next few weeks.
> > >
> >
> > IMHO, Without locking of pages in recovery mode Btree and GIN are not usable
> > while incomplete split exists - there is a nonconnected branch in tree.
>
> Now
> you may be right, that we do not have the correct locking for
> concurrency in the splitting algorithms. But I haven't seen any issue
> yet myself. But I will look again.
OK, I think I've found a problem.
In _bt_insertonpg(), if we split we do _bt_split() then do
_bt_insert_parent(), which then does _bt_insertonpg() recursively.
_bt_split() writes a WAL record but continues holding write locks.
btree_xlog_split() reads WAL record and does *not* continue to hold
write locks. So recovery locking differs from Lehman & Yao requirements
at that point.
So to make this concurrent safe, we must keep holding the locks in
btree_xlog_split() and record details of that as part of the the
log_incomplete_split process. That means skipping UnlockReleaseBuffer()
at the end of btree_xlog_split(). Those buffers would remain pinned and
locked.
If btree_xlog_insert() is called on a non-leaf node we will then get the
details of blocks still holding write locks from the incomplete split
data. Those locks could be held awhile if the incomplete split process
is screwed, but it would make this concurrent safe.
I guess we'd use the same technique for GIN. ginInsertValue() ??
Hmm, you release the lock at line 412, ginbtree.c before you get the
parent lock at line 428. That seems different to the L&Y interactions.
Am I looking in the wrong place?
-- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support