Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Race condition in b-tree page deletion |
Date | |
Msg-id | 5283A2FC.6040606@vmware.com Whole thread Raw |
In response to | Re: Race condition in b-tree page deletion (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Race condition in b-tree page deletion
Re: Race condition in b-tree page deletion Re: Race condition in b-tree page deletion Re: Race condition in b-tree page deletion |
List | pgsql-hackers |
On 12.11.2013 17:13, Robert Haas wrote: > On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> All that is left is to unlink the all the lingering pages in the branch >> we're deleting from their left and right siblings. This can be done at any >> later time, and if we error out or crash for some reason, next vacuum that >> comes along can finish the job. This is done one level at a time. Lock the >> leaf page, and the internal page the leaf page points to, and the internal >> page's left and right siblings (in the right order, not this order). Change >> the left and right sibling's right- and left-links, mark the internal page >> as deleted, and update the pointer in the leaf page to point to the child of >> the deleted internal page. Then recurse to the child, until we reach the >> leaf level. >> >> This has the nice extra property that we don't need the incomplete-action >> tracking in WAL recovery. I'd like to get rid of that anyway. > > I am not very knowledgeable of this code, but at least with my limited > knowledge I can't spot any problems with that approach. The only > thing that seems a little unfortunate is that the next vacuum may need > to finish cleaning things up; I thought at one point you were trying > to get rid of the amvacuumcleanup stuff. But maybe I'm > misremembering, and anyway it still seems like a step forward. Ah no, it only requires the next vacuum to clean up, if the first vacuum is interrupted for some reason. Normally, all of the steps are performed one after another in the same vacuum process, and half-dead pages will only be seen by backends that look at the affected pages at the same instant. But in case of an error or crash, the next vacuum will pick up where the previous one left off. Here's a patch implementing this scheme. One noteworthy detail if you're familiar with the old scheme: in this patch, only leaf pages are marked as half-dead. Never internal pages. This is the opposite of what we do currently; currently only internal pages can be marked half-dead, never leaf pages. This also gives us a chance to tell apart half-dead pages lingering after pg_upgrade from an older version, and pages marked as half-dead with the new code. The incomplete-action mechanism should have ensured that the former don't exist, but in case something went wrong some time before the upgrade and there is such a page in the tree after all, the patch gives a LOG message about it, advising to reindex. Searches will work fine in any case, but insertions to unfortunate pages might cause page splits and out-of-order keys in the upper levels, if the pre-upgrade half-dead pages are not cleaned up by reindexing. - Heikki
Attachment
pgsql-hackers by date: