On Fri, Sep 22, 2023 at 8:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My suspicion is that bugfix commit 70bc5833 missed some subtlety
> around what we need to do to make sure that the array keys stay "in
> sync" with the scan. I'll have time to debug the problem some more
> tomorrow.
I've figured out what's going on here.
If I make my test case "group by" both of the indexed columns from the
composite index (either index/table will do, since it's an equijoin),
a more detailed picture emerges that hints at the underlying problem:
┌───────┬─────────┬─────────┐
│ count │ small_a │ small_b │
├───────┼─────────┼─────────┤
│ 8,192 │ 1 │ 2 │
│ 8,192 │ 1 │ 3 │
│ 8,192 │ 1 │ 5 │
│ 8,192 │ 1 │ 10 │
│ 8,192 │ 1 │ 12 │
│ 8,192 │ 1 │ 17 │
│ 2,872 │ 1 │ 19 │
└───────┴─────────┴─────────┘
(7 rows)
The count for the final row is wrong. It should be 8,192, just like
the earlier counts for lower (small_a, small_b) groups. Notably, the
issue is limited to the grouping that has the highest sort order. That
strongly hints that the problem has something to do with "array
wraparound".
The query qual contains "WHERE small_a IN (1, 3)", so we'll "wraps
around" from cur_elem index 1 (value 3) to cur_elem index 0 (value 1),
without encountering any rows where small_a is 3 (because there aren't
any in the index). That in itself isn't the problem. The problem is
that _bt_restore_array_keys() doesn't consider wraparound. It sees
that "cur_elem == mark_elem" for all array scan keys, and figues that
it doesn't need to call _bt_preprocess_keys(). This is incorrect,
since the current set of search-type scan keys (the set most recently
output, during the last _bt_preprocess_keys() call) still have the
value "3".
The fix for this should be fairly straightforward. We must teach
_bt_restore_array_keys() to distinguish "past the end of the array"
from "after the start of the array", so that doesn't spuriously skip a
required call to _bt_preprocess_keys() . I already see that the
problem goes away once _bt_restore_array_keys() is made to call
_bt_preprocess_keys() unconditionally, so I'm already fairly confident
that this will work.
--
Peter Geoghegan