Thread: nbtree's ScalarArrayOp array mark/restore code appears to be buggy
Attached test case demonstrates an issue with nbtree's mark/restore code. All supported versions are affected. My suspicion is that bugfix commit 70bc5833 missed some subtlety around what we need to do to make sure that the array keys stay "in sync" with the scan. I'll have time to debug the problem some more tomorrow. My ScalarArrayOp project [1] seems unaffected by the bug, so I don't expect it'll take long to get to the bottom of this. This is probably due to its general design, and not any specific detail. The patch makes the relationship between the current scan position and the current array keys a great deal looser. [1] https://commitfest.postgresql.org/44/4455/ -- Peter Geoghegan
Attachment
On Fri, Sep 22, 2023 at 8:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > My suspicion is that bugfix commit 70bc5833 missed some subtlety > around what we need to do to make sure that the array keys stay "in > sync" with the scan. I'll have time to debug the problem some more > tomorrow. I've figured out what's going on here. If I make my test case "group by" both of the indexed columns from the composite index (either index/table will do, since it's an equijoin), a more detailed picture emerges that hints at the underlying problem: ┌───────┬─────────┬─────────┐ │ count │ small_a │ small_b │ ├───────┼─────────┼─────────┤ │ 8,192 │ 1 │ 2 │ │ 8,192 │ 1 │ 3 │ │ 8,192 │ 1 │ 5 │ │ 8,192 │ 1 │ 10 │ │ 8,192 │ 1 │ 12 │ │ 8,192 │ 1 │ 17 │ │ 2,872 │ 1 │ 19 │ └───────┴─────────┴─────────┘ (7 rows) The count for the final row is wrong. It should be 8,192, just like the earlier counts for lower (small_a, small_b) groups. Notably, the issue is limited to the grouping that has the highest sort order. That strongly hints that the problem has something to do with "array wraparound". The query qual contains "WHERE small_a IN (1, 3)", so we'll "wraps around" from cur_elem index 1 (value 3) to cur_elem index 0 (value 1), without encountering any rows where small_a is 3 (because there aren't any in the index). That in itself isn't the problem. The problem is that _bt_restore_array_keys() doesn't consider wraparound. It sees that "cur_elem == mark_elem" for all array scan keys, and figues that it doesn't need to call _bt_preprocess_keys(). This is incorrect, since the current set of search-type scan keys (the set most recently output, during the last _bt_preprocess_keys() call) still have the value "3". The fix for this should be fairly straightforward. We must teach _bt_restore_array_keys() to distinguish "past the end of the array" from "after the start of the array", so that doesn't spuriously skip a required call to _bt_preprocess_keys() . I already see that the problem goes away once _bt_restore_array_keys() is made to call _bt_preprocess_keys() unconditionally, so I'm already fairly confident that this will work. -- Peter Geoghegan
On Sat, Sep 23, 2023 at 11:47 AM Peter Geoghegan <pg@bowt.ie> wrote: > The fix for this should be fairly straightforward. We must teach > _bt_restore_array_keys() to distinguish "past the end of the array" > from "after the start of the array", so that doesn't spuriously skip a > required call to _bt_preprocess_keys() . I already see that the > problem goes away once _bt_restore_array_keys() is made to call > _bt_preprocess_keys() unconditionally, so I'm already fairly confident > that this will work. Attached draft patch shows how this could work. _bt_restore_array_keys() has comments that seem to suppose that calling _bt_preprocess_keys is fairly expensive, and something that's well worth avoiding. But...is it, really? I wonder if we'd be better off just biting the bullet and always calling _bt_preprocess_keys here. Could it really be such a big cost, compared to pinning and locking the page in the btrestpos() path? The current draft SAOP patch calls _bt_preprocess_keys() with a buffer lock held. This is very likely unsafe, so I'll need to come up with a principled approach (e.g. a stripped down, specialized version of _bt_preprocess_keys that's safe to call while holding a lock seems doable). I've been able to put that off for a few months now because it just doesn't impact my microbenchmarks to any notable degree (and not just in those cases where we can use the "numberOfKeys == 1" fast path at the start of _bt_preprocess_keys). This at least suggests that the cost of always calling _bt_preprocess_keys in _bt_restore_array_keys might be acceptable. -- Peter Geoghegan
Attachment
On Sat, Sep 23, 2023 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached draft patch shows how this could work. > > _bt_restore_array_keys() has comments that seem to suppose that > calling _bt_preprocess_keys is fairly expensive, and something that's > well worth avoiding. But...is it, really? I wonder if we'd be better > off just biting the bullet and always calling _bt_preprocess_keys > here. My current plan is to commit something close to my original patch on Wednesday or Thursday. My proposed fix is minimally invasive (it still allows _bt_restore_array_keys to avoid most calls to _bt_preprocess_keys), so I don't see any reason to delay acting here. -- Peter Geoghegan