Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date
Msg-id CAH2-WzkrKXZa4wHxJ4rrggWOn8g-Kdu75WGegRJEPxS68PK-hA@mail.gmail.com
Whole thread Raw
In response to Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
List pgsql-hackers
On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Can you pull these planner changes into their own commit(s)?
> As mentioned upthread, it's a significant change in behavior that
> should have separate consideration and reference in the commit log. I
> really don't think it should be buried in the 5th paragraph of an
> "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> changes of btree are arguably independent of the planner changes, as
> the btree changes improve performance even if we ignore that it
> implements strict result ordering.

I'm not going to break out the planner changes, because they're *not*
independent in any way. You could say the same thing about practically
any work that changes the planner. They're "buried" in the 5th
paragraph of the commit message. If an interested party can't even
read that far to gain some understanding of a legitimately complicated
piece of work such as this, I'm okay with that.

> An independent thought when reviewing the finaltup / HIKEY scan
> optimization parts of this patch:
> The 'use highkey to check for next page's data range' optimization is
> useful, but can currently only be used for scans that go to the right.
> Could we use a similar optimization in the inverse direction if we
> marked the right page of a split with "the left page has a HIKEY based
> on this page's (un)truncated leftmost value" or "left page's HIKEY was
> truncated to N key attributes"? It'd give one bit of information,
> specifically that it does (or does not) share some (or all) key
> attributes with this page's minimum value, which allows for earlier
> scan termination in boundary cases.

That would have to be maintained in all sorts of different places in
nbtree. And could be broken at any time by somebody inserting a
non-pivot tuple before every existing one on the leaf page. Doesn't
seem worth it to me.

If we were to do something like this then it would be discussed as
independent work. It's akin to adding a low key, which could be used
in several different places. As you say, it's totally independent.

> > +++ b/src/include/access/nbtree.h
> > +#define SK_BT_RDDNARRAY    0x00040000    /* redundant in array preprocessing */
>
> I think "made redundant" is more appropriate than just "redundant";
> the array key is not itself redundant in array preprocessing (indeed
> we actually work hard on that key during preprocessing to allow us to
> mark it redundant)

Meh. I did it that way to fit under 78 chars while staying on the same
line. I don't think that it matters.

> I think this was mentioned before, but can't we have an argument or
> version of _bt_checkkeys that makes it not advance the array keys, so
> that we can keep this optimization? Or, isn't that now
> _bt_check_compare?

As I said to Heikki, thinking about this some more is on my todo list.
I mean the way that this worked had substantial revisions on HEAD
right before I posted v9. v9 was only to fix the bit rot that that
caused.

> For an orderlines table with 1000s of lines per order, we would
> greatly benefit from a query `order_id = ANY ('{1, 3, 5}')` that
> handles scan keys more efficiently; the optimization which is being
> disabled here.

The way that these optimizations might work with the mechanism from
the patch isn't some kind of natural extension to what's there
already. We'll need new heuristics to not waste cycles. Applying all
of the optimizations together just isn't trivial, and it's not yet
clear what really makes sense. Combining the two optimizations more or
less adds another dimension of complexity.

> > +++ b/src/backend/access/nbtree/nbtutils.c
> > _bt_preprocess_array_keys(IndexScanDesc scan)
> The _bt_preprocess_array_keys code is now broken due to type
> confusion, as it assumes there is only one array subtype being
> compared per attribute in so.orderProcs.

I've been aware of this for some time, but didn't think that it was
worth bringing up before I had a solution to present...

> I'm also concerned about the implications of this in
> _bt_binsrch_array_skey, as this too assumes the same compare operator
> is always used for all array operations on each attribute. We may need
> one so->orderProcs entry for each array key, but at least one per
> sk_subtype of each array key.

...since right now it's convenient to make so->orderProcs have a 1:1
correspondence with index key columns....

> I also notice that the merging of values doesn't seem to be applied
> optimally with mixed typed array operations: num = int[] AND num =
> bigint[] AND num = int[] doesn't seem to merge the first and last
> array ops. I'm also concerned about being (un)able to merge

...which ought to work and be robust, once the cross-type support is
in place. That is, it should work once we really can assume that there
really must be exactly one so->orderProcs entry per equality-strategy
scan key in all cases -- including in highly obscure corner cases
involving a mix of cross-type comparisons that are also redundant.
(Though as I go into below, I can see at least 2 viable solutions to
the problem.)

> > +/*
> > + * _bt_merge_arrays() -- merge together duplicate array keys
> > + *
> > + * Both scan keys have array elements that have already been sorted and
> > + * deduplicated.
> > + */
>
> As I mentioned upthread, I find this function to be very wasteful, as
> it uses N binary searches to merge join two already sorted arrays,
> resulting in a O(n log(m)) complexity, whereas a normal merge join
> should be O(n + m) once the input datasets are sorted.

And as I mentioned upthread, I think that you're making a mountain out
of a molehill here. This is not a merge join. Even single digit
thousand of array elements should be considered huge. Plus this code
path is only hit with poorly written queries.

> Please fix this, as it shows up in profiling of large array merges.
> Additionally, as it merges two arrays of unique items into one,
> storing only matching entries, I feel that it is quite wasteful to do
> this additional allocation here. Why not reuse the original allocation
> immediately?

Because that's adding more complexity for a code path that will hardly
ever be used in practice. For something that happens once, during a
preprocessing step.

> > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > +                             IndexTuple tuple, int sktrig, bool validtrig)
>
> I don't quite understand what the 'validtrig' argument is used for.
> There is an assertion that it is false under some conditions in this
> code, but it's not at all clear to me why that would have to be the
> case - it is called with `true` in one of the three call sites. Could
> the meaning of this be clarified?

Sure, I'll add a comment.

> I also feel that this patch includes several optimizations such as
> this sktrig argument which aren't easy to understand. Could you pull
> that into a separately reviewable patch?

It probably makes sense to add the extra preprocessing stuff out into
its own commit, since I tend to agree that that's an optimization that
can be treated as unrelated (and isn't essential to the main thrust of
the patch).

However, the sktrig thing isn't really like that. We need to do things
that way for required inequality scan keys. It doesn't make sense to
not just do it for all required scan keys (both equality and
inequality strategy scan keys) right from the start.

> Additionally, could you try to create a single point of entry for the
> array key stuff that covers the new systems? I've been trying to wrap
> my head around this, and it's taking a lot of effort.

I don't understand what you mean here.

> > _bt_advance_array_keys
>
> Thinking about the implementation here:
> We require transitivity for btree opclasses, where A < B implies NOT A
> = B, etc. Does this also carry over into cross-type operators?

Yes, it carries like that.

> Would it be valid to add support methods for truncatedint4 to an int4
> btree opclass, or is transitivity also required for all operations?
> i.e. all values that one operator class considers unique within an
> opfamily must be considered unique for all additional operators in the
> opfamily, or is that not required?
> If not, then that would pose problems for this patch, as the ordering
> of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
> ('{0,65536}'::truncatedint4[]) could potentially skip results.

There are roughly two ways to deal with this sort of thing (which
sounds like a restatement of the issue shown by your test case?). They
are:

1. Add support for detecting redundant scan keys, even when cross-type
operators are involved. (Discussed earlier.)

2. Be prepared to have more than one scan key per index key column in
rare edge-cases. This means that I can no longer subscript
so->orderProcs using an attnum.

I intend to use whichever approach works out to be the simplest and
most maintainable. Note that there is usually nothing that stops me
from having redundant scan keys if it can't be avoided via
preprocessing -- just like today.

Actually...I might have to use a hybrid of 1 and 2. Because we need to
be prepared for the possibility that it just isn't possible to
determine redundancy at all due to a lack of suitable cross-type
operators -- I don't think that it would be okay to just throw an
error there (that really would be a regression against Postgres 16).
For that reason I will most likely need to find a way for
so->orderProcs to be subscripted using something that maps to equality
scan keys (rather than having it map to a key column).

> I'm also no fan of the (tail) recursion. I would agree that this is
> unlikely to consume a lot of stack, but it does consume stackspace
> nonetheless, and I'd prefer if it was not done this way.

I disagree. The amount of stack space it consumes in the worst case is fixed.

> I notice an assertion error here:
> > +            Assert(cur->sk_strategy != BTEqualStrategyNumber);
> > +            Assert(all_required_sk_satisfied);
> > +            Assert(!foundRequiredOppositeDirOnly);
> > +
> > +            foundRequiredOppositeDirOnly = true;
>
> This assertion can be hit with the following test case:
>
> CREATE TABLE test AS
> SELECT i a, i b, i c FROM generate_series(1, 1000) i;
> CREATE INDEX ON test(a, b, c); ANALYZE;
> SELECT count(*) FROM test
> WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
> AND b = ANY ('{1,2,3}');

Will fix.

> > +_bt_update_keys_with_arraykeys(IndexScanDesc scan)
>
> I keep getting confused by the mixing of integer increments and
> pointer increments. Could you explain why in this code you chose to
> increment a pointer for "ScanKey cur", while using arrray indexing for
> other fields? It feels very arbitrary to me, and that makes the code
> difficult to follow.

Because in one case we follow the convention for scan keys, whereas
so->orderProcs is accessed via an attribute number subscript.

> > +++ b/src/test/regress/sql/btree_index.sql
> > +-- Add tests to give coverage of various subtle issues.
> > +--
> > +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> > +--
> > +-- Here we don't remember the scan's array keys before processing a page, only
> > +-- after processing a page (which is implicit, it's just the scan's current
> > +-- keys).  So when we move the scan backwards we think that the top-level scan
> > +-- should terminate, when in reality it should jump backwards to the leaf page
> > +-- that we last visited.
>
> I notice this adds a complex test case that outputs many rows. Can we
> do with less rows if we build the index after data insertion, and with
> a lower (non-default) fillfactor?

Probably not. It was actually very hard to come up with these test
cases, which tickle the implementation in just the right way to
demonstrate that the code in places like _bt_steppage() is actually
required. It took me a rather long time to just prove that much. Not
sure that we really need this. But thought I'd include it for the time
being, just so that reviewers could understand those changes.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: [17] CREATE SUBSCRIPTION ... SERVER
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Error "initial slot snapshot too large" in create replication slot