Re: Nasty btree deletion bug - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Nasty btree deletion bug
Date
Msg-id 6392.1161821141@sss.pgh.pa.us
Whole thread Raw
In response to Nasty btree deletion bug  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Nasty btree deletion bug
List pgsql-hackers
I wrote:
> I've been analyzing Ed L's recent report of index corruption:
> http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php

After some thought, it still seems to me to be impractical to delete the
rightmost child of a btree page while other children remain.  Doing this
would require either moving the child's keyspace assignment left to its
left sibling, or moving that keyspace right to the first child of the
parent's right sibling, and either of these choices means having to
adjust page "high keys" in a way that's not certain to work.  (For
instance, moving the keyspace left means assigning the victim page's
high key to its left sibling, and there might not be room in the left
sibling page if the desired key is longer than what's there.  In the
other case the same problem of having to replace a key with a
potentially longer one exists, but it manifests at the grandparent
level.)

So I think the rule needs to be "don't delete the rightmost child unless
it's the only child, in which case you can delete the parent too --- but
the same restriction must be observed at the next level up".

I believe we can handle this by doing a precheck before starting a
delete of a level-zero page: scan up the stack to check that the
condition is satisfied for each level that needs to be deleted.
The tricky part is that we don't want to exclusive-lock all those
pages at once (we could do it without deadlock risk, but the concurrency
hit seems bad).  I think though that we can assume that if the condition
is checked it will remain true until we finish the delete:

1. A page that isn't rightmost child can't become so while we're not
looking, because that would require a delete to occur, and only VACUUM
does index page deletes, and we already disallow two concurrent VACUUMs
on the same index.

2. A page that is an only child could acquire a sibling only if it's
split, but that would imply an insert (in fact multiple inserts) into
the original level-zero page.  We'll recheck emptiness of the level-zero
page after we acquire write lock on it to begin the actual deletion,
at which point it's still safe to abandon the delete attempt.

The recent patch to allow ordinary non-vacuum processes to delete index
entries retail makes #2 a little trickier than meets the eye.  One could
imagine a scenario where between the times VACUUM leaves the level-zero
page and reacquires lock on it, enough entries were inserted to split
the page and then they all got deleted again by that patch.  However
that patch in its present form cannot leave a page in a completely
empty state, because it's only invoked as part of an insertion attempt.
(If it did manage to delete all the existing entries, then the same
process would insert a new entry onto the same page before unlocking
it.)  So I think it's OK, but we'll need to be wary about any proposals
to remove index entries in other contexts.

The concept of a half-dead page would remain, but it'd be a transient
state that would normally only persist for a moment between atomic
page-delete actions.  If we crash between two such actions, the
half-dead page would remain present, but would be found and cleaned up
by the next VACUUM.  In the meantime it wouldn't cause any problem
because the keyspace it gives up will belong to a sibling of the same
parent at whatever level the delete is ultimately supposed to stop at,
and so inserts and even splits in that keyspace won't create an
inconsistency.  Alternatively, we could have WAL crash recovery complete
the multilevel deletion using the same sort of remember-pending-actions
logic we use now to handle splits.

Comments?  Have I missed anything?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: [DOCS] Replication documentation addition
Next
From: Bruce Momjian
Date:
Subject: Re: [DOCS] Replication documentation addition