On 06.09.2011 16:40, Robert Haas wrote:
> On Tue, Sep 6, 2011 at 6:21 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> The way it would work is that on page split the right page is flagged with
>> MISSING_DOWNLINK flag. When the downlink is inserted into the parent, the
>> flag is cleared in the same critical section as the WAL record for the
>> insertion of the parent is written. Normally, a backend would never see the
>> flag set, because the locks on the split pages are not released until the
>> parent record is written and the flag cleared again. But if inserting the
>> downlink fails for any reason, the next inserter or vacuum that steps on the
>> page can finish the split by inserting the downlink.
>>
>> Unfortunately that means holding the locks on the split pages longer than we
>> do at the moment. Currently they are released as soon as the parent page is
>> locked; with this change they would need to be held until the WAL record of
>> the downlink insertion is done. B-tree is so heavily used that I'm a bit
>> hesitant to sacrifice any concurrency there, but I don't think it would be
>> noticeable in practice.
>
> Do you really need to hold the page locks for all that time, or could
> you cheat? Like... release the locks on the split pages but then go
> back and reacquire them to clear the flag...
Hmm, there's two issues with that:
1. While you're not holding the locks on the child pages, someone can
step onto the page and see that the MISSING_DOWNLINK flag is set, and
try to finish the split for you.
2. If you don't hold the page locked while you clear the flag, someone
can start and finish a checkpoint after you've inserted the downlink,
and before you've cleared the flag. You end up in a scenario where the
flag is set, but the page in fact *does* have a downlink in the parent.
So, nope, we can't cheat.
-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com