Work from commit 5b861baa (later backpatched as commit 43e409ce)
taught nbtree to press on with vacuuming an index when page deletion
fails to "re-find" a downlink in the target page's parent (or in some
page to the right of the parent) due to index corruption.
To recap, avoiding ERRORs during vacuuming (even those caused by index
corruption) is useful because there is no reason to expect the error
to go away on its own; we're relying on the DBA to notice the error
and REINDEX before wraparound/xidStopLimit kicks in. This is at least
the case on versions before 14, where the failsafe can eventually
kick-in and avoid catastrophe (though the failsafe can only be
expected to avoid the worst consequences).
It has come to my attention that there is a remaining issue of the
same general nature in nbtree VACUUM's page deletion code. Though this
remaining issue seems significantly less likely to come up in
practice, there is no reason to take any chances here. Attached patch
fixes it.
Also attached is a bugfix for a minor issue in amcheck's
bt_index_parent_check() function, which I noticed in passing, while I
tested the first patch. We assumed that we'd always land on the
leftmost page on each level first (the leftmost according to internal
pages one level up). That assumption is faulty because page deletion
of the leftmost page is quite possible. Page deletion can be
interrupted, leaving a half-dead leaf page (possibly the leftmost leaf
page) without any downlink one level up, while still leaving a left
sibling link on the leaf level (in the leaf page that isn't about to
become the leftmost, but won't until the interrupted page deletion can
be completed).
IMV this should be backpatched all the way. The issue in question is
rather unlikely to come up. But the fix that I've come up with is very
well targeted. It seems just about impossible for it to affect any
user that didn't already have a serious problem (without the fix).
--
Peter Geoghegan