Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date
Msg-id CAEze2WhVTv8KsNSmxcJxAV5pLsWsHhdafhor_viQh_O+wq3T2A@mail.gmail.com
Whole thread Raw
In response to Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
List pgsql-hackers
On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Can you pull these planner changes into their own commit(s)?
> > As mentioned upthread, it's a significant change in behavior that
> > should have separate consideration and reference in the commit log. I
> > really don't think it should be buried in the 5th paragraph of an
> > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> > changes of btree are arguably independent of the planner changes, as
> > the btree changes improve performance even if we ignore that it
> > implements strict result ordering.
>
> I'm not going to break out the planner changes, because they're *not*
> independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

> You could say the same thing about practically
> any work that changes the planner. They're "buried" in the 5th
> paragraph of the commit message. If an interested party can't even
> read that far to gain some understanding of a legitimately complicated
> piece of work such as this, I'm okay with that.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".
There are also no changes in amapi.h that would signal any index AM
author that expectations have changed. I really don't think you can
just ignore all that, and I believe this to also be the basis of
Heikki's first comment.

> As I said to Heikki, thinking about this some more is on my todo list.
> I mean the way that this worked had substantial revisions on HEAD
> right before I posted v9. v9 was only to fix the bit rot that that
> caused.

Then I'll be waiting for that TODO item to be resolved.

> > > +++ b/src/backend/access/nbtree/nbtutils.c
> > > +/*
> > > + * _bt_merge_arrays() -- merge together duplicate array keys
> > > + *
> > > + * Both scan keys have array elements that have already been sorted and
> > > + * deduplicated.
> > > + */
> >
> > As I mentioned upthread, I find this function to be very wasteful, as
> > it uses N binary searches to merge join two already sorted arrays,
> > resulting in a O(n log(m)) complexity, whereas a normal merge join
> > should be O(n + m) once the input datasets are sorted.
>
> And as I mentioned upthread, I think that you're making a mountain out
> of a molehill here. This is not a merge join.

We're merging two sorted sets and want to retain only the matching
entries, while retaining the order of the data. AFAIK the best
algorithm available for this is a sort-merge join.
Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue.

> Even single digit
> thousand of array elements should be considered huge. Plus this code
> path is only hit with poorly written queries.

Would you accept suggestions?

> > Please fix this, as it shows up in profiling of large array merges.
> > Additionally, as it merges two arrays of unique items into one,
> > storing only matching entries, I feel that it is quite wasteful to do
> > this additional allocation here. Why not reuse the original allocation
> > immediately?
>
> Because that's adding more complexity for a code path that will hardly
> ever be used in practice. For something that happens once, during a
> preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki. While I do think it is rare, the existence of this path
that merges these arrays implies the need for merging these arrays,
which thus

> > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > > +                             IndexTuple tuple, int sktrig, bool validtrig)
> >
> > I don't quite understand what the 'validtrig' argument is used for.
> > There is an assertion that it is false under some conditions in this
> > code, but it's not at all clear to me why that would have to be the
> > case - it is called with `true` in one of the three call sites. Could
> > the meaning of this be clarified?
>
> Sure, I'll add a comment.

Thanks!

> > I also feel that this patch includes several optimizations such as
> > this sktrig argument which aren't easy to understand. Could you pull
> > that into a separately reviewable patch?
>
> It probably makes sense to add the extra preprocessing stuff out into
> its own commit, since I tend to agree that that's an optimization that
> can be treated as unrelated (and isn't essential to the main thrust of
> the patch).
>
> However, the sktrig thing isn't really like that. We need to do things
> that way for required inequality scan keys. It doesn't make sense to
> not just do it for all required scan keys (both equality and
> inequality strategy scan keys) right from the start.

I understand that in some places the "triggered by scankey" concept is
required, but in other places the use of it is explicitly flagged as
an optimization (specifically, in _bt_advance_array_keys), which
confused me.

> > Additionally, could you try to create a single point of entry for the
> > array key stuff that covers the new systems? I've been trying to wrap
> > my head around this, and it's taking a lot of effort.
>
> I don't understand what you mean here.

The documentation (currently mostly code comments) is extensive, but
still spread around various inline comments and comments on functions;
with no place where a clear top-level design is detailed.
I'll agree that we don't have that for the general systems in
_bt_next() either, but especially with this single large patch it's
difficult to grasp the specific differences between the various
functions.

> > > _bt_advance_array_keys
> >
> > Thinking about the implementation here:
> > We require transitivity for btree opclasses, where A < B implies NOT A
> > = B, etc. Does this also carry over into cross-type operators?
>
> Yes, it carries like that.
>
> > Would it be valid to add support methods for truncatedint4 to an int4
> > btree opclass, or is transitivity also required for all operations?
> > i.e. all values that one operator class considers unique within an
> > opfamily must be considered unique for all additional operators in the
> > opfamily, or is that not required?
> > If not, then that would pose problems for this patch, as the ordering
> > of A = ANY ('{1, 2, 3}'::int4[]) AND A = ANY
> > ('{0,65536}'::truncatedint4[]) could potentially skip results.
>
> There are roughly two ways to deal with this sort of thing (which
> sounds like a restatement of the issue shown by your test case?).

It wasn't really meant as a restatement: It is always unsafe to ignore
upper bits, as the index isn't organized by that. However, it *could*
be safe to ignore the bits with lowest significance, as the index
would still be organized correctly even in that case, for int4 at
least. Similar to how you can have required scankeys for the prefix of
an (int2, int2) index, but not the suffix (as of right now at least).

The issue I was trying to show is that if you have a type that ignores
some part of the key for comparison like truncatedint4 (which
hypothetically would do ((a>>16) < (b>>16)) on int4 types), then this
might cause issues if that key was advanced before more precise
equality checks.

This won't ever be an issue when there is a requirement that if a = b
and b = c then a = c must hold for all triples of typed values and
operations in a btree opclass, as you mention above.


> They
> are:
>
> 1. Add support for detecting redundant scan keys, even when cross-type
> operators are involved. (Discussed earlier.)
>
> 2. Be prepared to have more than one scan key per index key column in
> rare edge-cases. This means that I can no longer subscript
> so->orderProcs using an attnum.

> Actually...I might have to use a hybrid of 1 and 2. Because we need to
> be prepared for the possibility that it just isn't possible to
> determine redundancy at all due to a lack of suitable cross-type
> operators -- I don't think that it would be okay to just throw an
> error there (that really would be a regression against Postgres 16).

Agreed.

> > I'm also no fan of the (tail) recursion. I would agree that this is
> > unlikely to consume a lot of stack, but it does consume stackspace
> > nonetheless, and I'd prefer if it was not done this way.
>
> I disagree. The amount of stack space it consumes in the worst case is fixed.

Is it fixed? Without digging very deep into it, it looks like it can
iterate on the order of n_scankeys deep? The comment above does hint
on 2 iterations, but is not very clear about the conditions and why.

> > I notice an assertion error here:
> > > +            Assert(cur->sk_strategy != BTEqualStrategyNumber);
> > > +            Assert(all_required_sk_satisfied);
> > > +            Assert(!foundRequiredOppositeDirOnly);
> > > +
> > > +            foundRequiredOppositeDirOnly = true;
> >
> > This assertion can be hit with the following test case:
> >
> > CREATE TABLE test AS
> > SELECT i a, i b, i c FROM generate_series(1, 1000) i;
> > CREATE INDEX ON test(a, b, c); ANALYZE;
> > SELECT count(*) FROM test
> > WHERE a = ANY ('{1,2,3}') AND b > 1 AND c > 1
> > AND b = ANY ('{1,2,3}');
>
> Will fix.
>
> > > +_bt_update_keys_with_arraykeys(IndexScanDesc scan)
> >
> > I keep getting confused by the mixing of integer increments and
> > pointer increments. Could you explain why in this code you chose to
> > increment a pointer for "ScanKey cur", while using arrray indexing for
> > other fields? It feels very arbitrary to me, and that makes the code
> > difficult to follow.
>
> Because in one case we follow the convention for scan keys, whereas
> so->orderProcs is accessed via an attribute number subscript.

Okay, but how about this in _bt_merge_arrays?

+        Datum       *elem = elems_orig + i;

I'm not familiar with the scan key convention, as most other places
use reference+subscripting.

> > > +++ b/src/test/regress/sql/btree_index.sql
> > > +-- Add tests to give coverage of various subtle issues.
> > > +--
> > > +-- XXX This may not be suitable for commit, due to taking up too many cycles.
> > > +--
> > > +-- Here we don't remember the scan's array keys before processing a page, only
> > > +-- after processing a page (which is implicit, it's just the scan's current
> > > +-- keys).  So when we move the scan backwards we think that the top-level scan
> > > +-- should terminate, when in reality it should jump backwards to the leaf page
> > > +-- that we last visited.
> >
> > I notice this adds a complex test case that outputs many rows. Can we
> > do with less rows if we build the index after data insertion, and with
> > a lower (non-default) fillfactor?
>
> Probably not. It was actually very hard to come up with these test
> cases, which tickle the implementation in just the right way to
> demonstrate that the code in places like _bt_steppage() is actually
> required. It took me a rather long time to just prove that much. Not
> sure that we really need this. But thought I'd include it for the time
> being, just so that reviewers could understand those changes.

Makes sense, thanks for the explanation.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Matthias van de Meent
Date:
Subject: Re: Increasing IndexTupleData.t_info from uint16 to uint32
Next
From: Robert Haas
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning