The return value of _bt_bottomupdel_pass() is advisory; it reports
"failure" for a deletion pass that was just suboptimal (it rarely
reports failure because it couldn't delete anything at all). Bottom-up
deletion preemptively falls back on a version-orientated deduplication
pass when it didn't quite delete as many items as it hoped to delete.
This policy avoids thrashing, particularly with low cardinality
indexes, where the number of distinct TIDs per tableam/heapam block
tends to drive when and how TIDs get deleted.
I have noticed an unintended and undesirable interaction between this
new behavior and an older deduplication behavior: it's possible for
_bt_bottomupdel_pass() to return false to trigger a preemptive
version-orientated deduplication pass that ends up using
deduplication's "single value" strategy. This is just contradictory on
its face: a version-orientated deduplication pass tries to prevent a
version-driven page split altogether, whereas a single value strategy
deduplication pass is specifically supposed to set things up for an
imminent page split (a split that uses nbtsplitloc.c's single value
strategy). Clearly we shouldn't prepare for a page split and try to
avoid a page split at the same time!
The practical consequence of this oversight is that leaf pages full of
duplicates (all duplicates of the same single value) are currently
much more likely to have a version-driven page split (from non-HOT
updates) than similar pages that have two or three distinct key
values. Another undesirable consequence is that we'll waste cycles in
affected cases; any future bottom-up index deletion passes will waste
time on the tuples that the intervening deduplication pass
deliberately declined to merge together (as any single value dedup
pass will). In other words, the heuristics described in comments above
_bt_bottomupdel_finish_pending() can become confused by this
misbehavior (after an initial round of deletion + deduplication, in a
later round of deletion). This interaction is clearly a bug. It's easy
to avoid.
Attached patch fixes the bug by slightly narrowing the conditions
under which we'll consider if we should apply deduplication's single
value strategy. We were already not even considering it with a unique
index, where it was always clear that this is only a
version-orientated deduplication pass. It seems natural to also check
whether or not we just had a "failed" call to _bt_bottomupdel_pass()
-- this is a logical extension of what we do already. Barring
objections, I will apply this patch (and backpatch to Postgres 14) in
a few days.
--
Peter Geoghegan