Re: Nasty btree deletion bug - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Nasty btree deletion bug |
Date | |
Msg-id | 6392.1161821141@sss.pgh.pa.us Whole thread Raw |
In response to | Nasty btree deletion bug (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Nasty btree deletion bug
|
List | pgsql-hackers |
I wrote: > I've been analyzing Ed L's recent report of index corruption: > http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php After some thought, it still seems to me to be impractical to delete the rightmost child of a btree page while other children remain. Doing this would require either moving the child's keyspace assignment left to its left sibling, or moving that keyspace right to the first child of the parent's right sibling, and either of these choices means having to adjust page "high keys" in a way that's not certain to work. (For instance, moving the keyspace left means assigning the victim page's high key to its left sibling, and there might not be room in the left sibling page if the desired key is longer than what's there. In the other case the same problem of having to replace a key with a potentially longer one exists, but it manifests at the grandparent level.) So I think the rule needs to be "don't delete the rightmost child unless it's the only child, in which case you can delete the parent too --- but the same restriction must be observed at the next level up". I believe we can handle this by doing a precheck before starting a delete of a level-zero page: scan up the stack to check that the condition is satisfied for each level that needs to be deleted. The tricky part is that we don't want to exclusive-lock all those pages at once (we could do it without deadlock risk, but the concurrency hit seems bad). I think though that we can assume that if the condition is checked it will remain true until we finish the delete: 1. A page that isn't rightmost child can't become so while we're not looking, because that would require a delete to occur, and only VACUUM does index page deletes, and we already disallow two concurrent VACUUMs on the same index. 2. A page that is an only child could acquire a sibling only if it's split, but that would imply an insert (in fact multiple inserts) into the original level-zero page. We'll recheck emptiness of the level-zero page after we acquire write lock on it to begin the actual deletion, at which point it's still safe to abandon the delete attempt. The recent patch to allow ordinary non-vacuum processes to delete index entries retail makes #2 a little trickier than meets the eye. One could imagine a scenario where between the times VACUUM leaves the level-zero page and reacquires lock on it, enough entries were inserted to split the page and then they all got deleted again by that patch. However that patch in its present form cannot leave a page in a completely empty state, because it's only invoked as part of an insertion attempt. (If it did manage to delete all the existing entries, then the same process would insert a new entry onto the same page before unlocking it.) So I think it's OK, but we'll need to be wary about any proposals to remove index entries in other contexts. The concept of a half-dead page would remain, but it'd be a transient state that would normally only persist for a moment between atomic page-delete actions. If we crash between two such actions, the half-dead page would remain present, but would be found and cleaned up by the next VACUUM. In the meantime it wouldn't cause any problem because the keyspace it gives up will belong to a sibling of the same parent at whatever level the delete is ultimately supposed to stop at, and so inserts and even splits in that keyspace won't create an inconsistency. Alternatively, we could have WAL crash recovery complete the multilevel deletion using the same sort of remember-pending-actions logic we use now to handle splits. Comments? Have I missed anything? regards, tom lane
pgsql-hackers by date: