Re: B-tree parent pointer and checkpoints - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: B-tree parent pointer and checkpoints |
Date | |
Msg-id | 4CDD9397.8040600@enterprisedb.com Whole thread Raw |
In response to | Re: B-tree parent pointer and checkpoints (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: B-tree parent pointer and checkpoints
|
List | pgsql-hackers |
On 11.11.2010 20:34, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> Hmm, we don't currently keep track of that when we descend the tree to >> choose the target page, but perhaps an extra Consistent call to check >> that would be acceptable. We already call Penalty for every tuple on >> each internal node on the way, so compared to that one more call should >> not be too expensive. If we do that, I think it would simplify the >> algorithm quite a bit to just update all the parents on the way down, >> instead of traversing up from the bottom after inserting the tuple to >> the leaf. > > Oh, that's a really good idea, I think. But what about page splits? > I guess in the case of a split, you'd be replacing the parent entry > anyway, so having previously updated it to something larger doesn't > really cause a problem other than wasting a few cycles --- which are > probably still less than you save by not having to traverse back up. I started looking at this, and run into a problem with page splits. The right-links in GiST work differently from b-tree, a right-link is only followed if we detect a concurrent page split. A concurrent split is detected by comparing the "NSN" field on the child page against the LSN that we saw on the parent when walking down. It means that if you just leave the incompletely split page in the tree, where one half is missing the parent pointer, scans will not find any tuples on that page. They would at first, but as soon as the the parent page is updated due to some unrelated insertion, the LSN of the parent is bumped above the NSN stored on the child, and the page becomes invisible to scanners. We avoid that problem during normal operation by keeping the parent page locked while the child is split, until the downlink is inserted into the parent. That blocks any other modifications to the parent page that would bump the LSN, until our downlink has been inserted. That doesn't work after crash recovery, as all the locks are released. I think we can work around that with a small modification to the page split algorithm. In a nutshell, when the child page is split, put a flag on the left half indicating that the rightlink must always be followed, regardless of the NSN. When the downlink is inserted to the parent, clear the flag. Setting and clearing of these flags need to be performed during WAL replay as well. So to split a page: (0. Lock the page to be split) 1. Split the page. Mark the rightlink in the left half with a flag indicating that it always needs to be followed. 2. Lock the parent. 3. Insert downlink. (The parent may need to be split too) 4. Clear the flag in the child, and update NSN to the LSN of the downlink insertion record. 5. Release child. 6. If the parent was split in step 3, goto 2. If we crash between steps 1 and 3, the rightlink will have the flag, so scans will know to always follow it. If we crash after step 3, recovery will replay steps 3 and 4, so scans will see the downlinks as usual. After a crash, the tree can be fixed up later by vacuum or subsequent inserts, by performing steps 2-4. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: