Re: B-tree parent pointer and checkpoints - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: B-tree parent pointer and checkpoints
Date
Msg-id 4CDD9397.8040600@enterprisedb.com
Whole thread Raw
In response to Re: B-tree parent pointer and checkpoints  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: B-tree parent pointer and checkpoints
List pgsql-hackers
On 11.11.2010 20:34, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Hmm, we don't currently keep track of that when we descend the tree to
>> choose the target page, but perhaps an extra Consistent call to check
>> that would be acceptable. We already call Penalty for every tuple on
>> each internal node on the way, so compared to that one more call should
>> not be too expensive. If we do that, I think it would simplify the
>> algorithm quite a bit to just update all the parents on the way down,
>> instead of traversing up from the bottom after inserting the tuple to
>> the leaf.
>
> Oh, that's a really good idea, I think.  But what about page splits?
> I guess in the case of a split, you'd be replacing the parent entry
> anyway, so having previously updated it to something larger doesn't
> really cause a problem other than wasting a few cycles --- which are
> probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. The 
right-links in GiST work differently from b-tree, a right-link is only 
followed if we detect a concurrent page split. A concurrent split is 
detected by comparing the "NSN" field on the child page against the LSN 
that we saw on the parent when walking down. It means that if you just 
leave the incompletely split page in the tree, where one half is missing 
the parent pointer, scans will not find any tuples on that page. They 
would at first, but as soon as the the parent page is updated due to 
some unrelated insertion, the LSN of the parent is bumped above the NSN 
stored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent page 
locked while the child is split, until the downlink is inserted into the 
parent. That blocks any other modifications to the parent page that 
would bump the LSN, until our downlink has been inserted. That doesn't 
work after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the page 
split algorithm. In a nutshell, when the child page is split, put a flag 
on the left half indicating that the rightlink must always be followed, 
regardless of the NSN. When the downlink is inserted to the parent, 
clear the flag. Setting and clearing of these flags need to be performed 
during WAL replay as well.

So to split a page:

(0. Lock the page to be split)
1. Split the page. Mark the rightlink in the left half with a flag 
indicating that it always needs to be followed.
2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)
4. Clear the flag in the child, and update NSN to the LSN of the 
downlink insertion record.
5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, so 
scans will know to always follow it. If we crash after step 3, recovery 
will replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequent 
inserts, by performing steps 2-4.


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: WIP: extensible enums
Next
From: Dimitri Fontaine
Date:
Subject: Re: wCTE behaviour