On Thu, Jul 31, 2025 at 5:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jul 31, 2025 at 5:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Cool, will you do the legwork?
>
> I'll give it a go.
Update on my progress:
I find that the test query takes ~95ms on Postgres 16. We're getting
"Inner Unique: true" on Postgres 16, but if I replace zsf_pkey with a
non-unique index it brings the execution time up slightly, to ~100ms.
Ideally, we'll be able to get Postgres 17+ at about parity with that.
I find that my WIP patch (which does the required sort in the planner
in the obvious way) brings the runtime down, from ~1500ms to ~165ms.
Obviously, this is a massive improvement -- but it's a bit
disappointing that we can't do better still.
Getting closer to ~100ms seems intuitively achievable to me, since the
index scan we're getting on 17 isn't supposed to be a whole lot
different to the one we get on 16 (in spite of the fact that we're
using a different index) -- all of the other executor nodes from the
plan are pretty much the same on each version. Why the remaining
shortfall?
One problem remains here: we're still doing more work than one would
hope at the start of btrescan, during array preprocessing. We're able
to skip the sort, of course, but just building a simple Datum array
via a call to deconstruct_array() is enough of a remaining bottleneck
to matter. Ideally, we'd do *all* of the work once for each SAOP
array, in the planner (assuming a Const argument). In order to do
that, we'd have to make the planner/executor pass nbtree an array that
has the exact structure that it works with at runtime: a raw Datum
array.
I find that once I make the planner and executor pass a raw Datum
array, we're much closer to my soft performance target: the query
runtime goes down to ~135ms. This isn't perfect, but it's much closer
to the theoretical ideal that I have in mind. We're still doing extra
work in the 17 index scan, compared to the one in 16, but I can't feel
too bad about that; looking up a separate ORDER proc for the
lower-order column isn't free, and being prepared to use a SAOP array
necessitates a little more memory allocation for preprocessing (we
make _bt_preprocess_keys use a partially-preprocessed copy of the
original input keys as its input keys when there's an SAOP array). So
~135ms is roughly in line with what I expect.
The problem with this more ambitious approach is that it is also much
more invasive. It bleeds into things like the plan cache.
EXPLAIN/ruleutils.c would need its own built in way to show the qual
that can work with this alternative "raw datum array" representation,
which I haven't bothered adding. I doubt that that complexity will pay
for itself.
My inclination is to pursue the simpler approach, and just accept the
remaining performance shortfall. This is a rare enough case that I
think that that'll be acceptable. But input on how to make this
trade-off would be helpful.
--
Peter Geoghegan