Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Race condition in b-tree page deletion
Date
Msg-id 527E677C.8040404@vmware.com
Whole thread Raw
In response to Re: Race condition in b-tree page deletion  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race condition in b-tree page deletion  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 09.11.2013 18:24, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> 2. The second-simplest solution I see is to keep locked the whole chain
>> of pages that will be deleted, and delete all of them as one atomic
>> WAL-logged operation. Ie. the leaf page, and all the parent pages above
>> it that will become half-dead, and the parent of the last half-dead page.
>
> This would be more tenable if we could put a known limit on the number of
> pages to be changed at once.  I'm not too awake at the moment, so maybe
> this is not possible, but could we simply decide in advance that we won't
> propagate page deletion up more than a-small-number of levels?  It'd
> require allowing a page to remain half-dead until some other vacuum comes
> along and fixes it, but that seems OK.

I don't feel comfortable leaving pages in half-dead state. Looking back 
at the archives, your original design ten years ago did exactly that 
(http://www.postgresql.org/message-id/8281.1045089764@sss.pgh.pa.us), 
but that turned out to be a bad idea 
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=70ce5c908202ada7616f7afded8a91bbf2742471). 
Mind you, even if we check that the half-dead page at some upper level 
is eventually deletable because it's not the rightmost child, it might 
become the rightmost child if we don't hold the lock so that the next 
vacuum cannot delete it, and we're back to square one.

We could just punt if more than X pages would need to be changed. That 
would mean that we never delete pages at the top (h - X) levels of the 
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once 
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge. 
As another data point, in the worst case the keys are so wide that only 
2 keys fit on each B-tree page. With that assumption, the tree can be at 
most 32 levels deep if you just insert into it with no deletions, since 
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not 
sure). Deletions make it more complicated, but I would be pretty 
surprised if you can construct a B-tree tallers than, say 40 levels.

Things gets tricky if shared_buffers is very small; with 
shared_buffers=16, you certainly can't hold more than 16 buffers pinned 
at once. (in fact, I think ginfast.c already has a problem with that...)

- Heikki



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Race condition in b-tree page deletion
Next
From: Heikki Linnakangas
Date:
Subject: Re: Race condition in b-tree page deletion