Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Race condition in b-tree page deletion
Date
Msg-id CA+TgmoZQRJyhwMc_y-2rugbOL7LpTp=3+yW2KaXvTkOJ653C1w@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in b-tree page deletion  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Race condition in b-tree page deletion  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Sat, Nov 9, 2013 at 12:40 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 09.11.2013 19:18, Heikki Linnakangas wrote:
>> On 09.11.2013 18:49, Heikki Linnakangas wrote:
>>> We could just punt if more than X pages would need to be changed. That
>>> would mean that we never delete pages at the top (h - X) levels of the
>>> tree. In practice that should be fine if X is high enough.
>>> As a data point, GIN list page deletion holds 16 pages locked at once
>>> (GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
>>> As another data point, in the worst case the keys are so wide that only
>>> 2 keys fit on each B-tree page. With that assumption, the tree can be at
>>> most 32 levels deep if you just insert into it with no deletions, since
>>> MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
>>> sure). Deletions make it more complicated, but I would be pretty
>>> surprised if you can construct a B-tree tallers than, say 40 levels.
>>
>> On further thought, it's worse than that. To delete a page, you need to
>> lock the left and right siblings, so you need 3 pages locked per each
>> level you delete...
>
> On further further thought, we don't need to unlink the pages immediately.
> It's enough to mark them as half-dead and remove the downlink to the upmost
> half-dead page. Marking a page as half-dead is as good as deleting it
> outright as far as searches and insertions are concerned. Actually unlinking
> the pages from the left and right siblings can be done at any later time,
> and doesn't need to be done in any particular order.
>
> So, my original musings about the number of concurrent locks needed still
> holds.

I think we've tried pretty hard to avoid algorithms where the maximum
number of lwlocks that must be held at one time is not a constant, and
I think we're in for a bad time of it if we start to deviate from that
principal.  I'm not sure what to do about this problem, but I think
locking N levels of the tree at once, where N can be as large as the
tree is deep, is probably a bad plan, whether the number of locks
required is N or 3N.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Fw: [COMMITTERS] pgsql: Fix blatantly broken record_image_cmp() logic for pass-by-value
Next
From: Robert Haas
Date:
Subject: Re: Comment - uniqueness of relfilenode