Home > mailing lists

Re: B-tree parent pointer and checkpoints - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: B-tree parent pointer and checkpoints
Date	November 12, 2010 15:21:09
Msg-id	4CDD9397.8040600@enterprisedb.com Whole thread Raw
In response to	Re: B-tree parent pointer and checkpoints (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: B-tree parent pointer and checkpoints
List	pgsql-hackers

Tree view

On 11.11.2010 20:34, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Hmm, we don't currently keep track of that when we descend the tree to
>> choose the target page, but perhaps an extra Consistent call to check
>> that would be acceptable. We already call Penalty for every tuple on
>> each internal node on the way, so compared to that one more call should
>> not be too expensive. If we do that, I think it would simplify the
>> algorithm quite a bit to just update all the parents on the way down,
>> instead of traversing up from the bottom after inserting the tuple to
>> the leaf.
>
> Oh, that's a really good idea, I think.  But what about page splits?
> I guess in the case of a split, you'd be replacing the parent entry
> anyway, so having previously updated it to something larger doesn't
> really cause a problem other than wasting a few cycles --- which are
> probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. The 
right-links in GiST work differently from b-tree, a right-link is only 
followed if we detect a concurrent page split. A concurrent split is 
detected by comparing the "NSN" field on the child page against the LSN 
that we saw on the parent when walking down. It means that if you just 
leave the incompletely split page in the tree, where one half is missing 
the parent pointer, scans will not find any tuples on that page. They 
would at first, but as soon as the the parent page is updated due to 
some unrelated insertion, the LSN of the parent is bumped above the NSN 
stored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent page 
locked while the child is split, until the downlink is inserted into the 
parent. That blocks any other modifications to the parent page that 
would bump the LSN, until our downlink has been inserted. That doesn't 
work after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the page 
split algorithm. In a nutshell, when the child page is split, put a flag 
on the left half indicating that the rightlink must always be followed, 
regardless of the NSN. When the downlink is inserted to the parent, 
clear the flag. Setting and clearing of these flags need to be performed 
during WAL replay as well.

So to split a page:

(0. Lock the page to be split)
1. Split the page. Mark the rightlink in the left half with a flag 
indicating that it always needs to be followed.
2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)
4. Clear the flag in the child, and update NSN to the LSN of the 
downlink insertion record.
5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, so 
scans will know to always follow it. If we crash after step 3, recovery 
will replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequent 
inserts, by performing steps 2-4.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: Tom Lane
Date: 12 November 2010, 15:20:32
Subject: Re: WIP: extensible enums

From: Dimitri Fontaine
Date: 12 November 2010, 16:14:26
Subject: Re: wCTE behaviour

Re: B-tree parent pointer and checkpoints - Mailing list pgsql-hackers

Previous

Next