Thread: Inefficient nbtree behavior with row-comparison quals

Inefficient nbtree behavior with row-comparison quals

From

Tom Lane

Date:

11 May 2024, 19:19:18

I spent some time looking into the performance complaint at [1],
which for the sake of self-containedness is

CREATE TABLE t(a int, b int);

INSERT INTO t(a, b)
SELECT
    (random() * 123456)::int AS a,
    (random() * 123456)::int AS b
FROM
    generate_series(1, 12345678);

CREATE INDEX my_idx ON t USING BTREE (a, b);

VACUUM ANALYZE t;

explain (analyze, buffers) select * from t
where row(a, b) > row(123450, 123444) and a = 0
order by a, b;

This produces something like

 Index Only Scan using my_idx on t  (cost=0.43..8.46 rows=1 width=8) (actual time=475.713..475.713 rows=0 loops=1)
   Index Cond: ((ROW(a, b) > ROW(123450, 123444)) AND (a = 0))
   Heap Fetches: 0
   Buffers: shared hit=1 read=33731
 Planning:
   Buffers: shared read=4
 Planning Time: 0.247 ms
 Execution Time: 475.744 ms

showing that we are reading practically the whole index, which
is pretty sad considering the index conditions are visibly
mutually contradictory.  What's going on?  I find that:

1. _bt_preprocess_keys, which is responsible for detecting
mutually-contradictory index quals, fails to do so because it
really just punts on row-comparison quals: it shoves them
directly from input to output scankey array without any
comparisons to other keys.  (In particular, that causes the
row-comparison qual to appear before the a = 0 one in the
output scankey array.)

2. The initial-positioning logic in _bt_first chooses "a = 0"
as determining where to start the scan, because it always
prefers equality over inequality keys.  (This seems reasonable.)

3. We really should stop the scan once we're past the last a = 0
index entry, which'd at least limit the damage.  However, at both
a = 0 and later entries, the row-comparison qual fails causing
_bt_check_compare to return immediately, without examining the
a = 0 key which is marked as SK_BT_REQFWD, and thus it does not
clear "continuescan".  Only when we finally reach an index entry
for which the row-comparison qual succeeds do we notice that
a = 0 is failing so we could stop the scan.

So this seems pretty horrid.  It would be nice if _bt_preprocess_keys
were smart enough to notice the contradictory nature of these quals,
but I grant that (a) that routine is dauntingly complex already and
(b) this doesn't seem like a common enough kind of query to be worth
moving heaven and earth to optimize.

However, I do think we should do something about the unstated
assumption that _bt_preprocess_keys can emit the quals (for a given
column) in any random order it feels like.  This is evidently not so,
and it's probably capable of pessimizing other examples besides this
one.  Unless we want to slow down _bt_check_compare by making it
continue to examine quals after the first failure, we need to insist
that required quals appear before non-required quals, thus ensuring
that a comparison failure will clear continuescan if possible.

Even that looks a little nontrivial, because it seems like nbtree
may be making some assumptions about the order in which array keys
appear.  I see the bit about

 * ... Some reordering of the keys
 * within each attribute may be done as a byproduct of the processing here.
 * That process must leave array scan keys (within an attribute) in the same
 * order as corresponding entries from the scan's BTArrayKeyInfo array info.

which I could cope with, but then there's this down around line 2967:

                 * Note: We do things this way around so that our arrays are
                 * always in the same order as their corresponding scan keys,
                 * even with incomplete opfamilies.  _bt_advance_array_keys
                 * depends on this.

However, despite the rather over-the-top verbosity of commenting in
_bt_advance_array_keys, it's far from clear why or how it depends on
that.  So I feel a little stuck about what needs to be done here.

            regards, tom lane

[1] https://www.postgresql.org/message-id/CAAdwFAxBjyrYUkH7u%2BEceTaztd1QxBtBY1Teux8K%3DvcGKe%3D%3D-A%40mail.gmail.com

Re: Inefficient nbtree behavior with row-comparison quals

From

Peter Geoghegan

Date:

11 May 2024, 20:12:18

On Sat, May 11, 2024 at 3:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This produces something like
>
>  Index Only Scan using my_idx on t  (cost=0.43..8.46 rows=1 width=8) (actual time=475.713..475.713 rows=0 loops=1)
>    Index Cond: ((ROW(a, b) > ROW(123450, 123444)) AND (a = 0))
>    Heap Fetches: 0
>    Buffers: shared hit=1 read=33731
>  Planning:
>    Buffers: shared read=4
>  Planning Time: 0.247 ms
>  Execution Time: 475.744 ms
>
> showing that we are reading practically the whole index, which
> is pretty sad considering the index conditions are visibly
> mutually contradictory.  What's going on?

There's another problem along these lines, that seems at least as bad:
queries involving contradictory >= and <= quals aren't recognized as
contradictory during preprocessing. There's no reason why
_bt_preprocessing_keys couldn't detect that case; it just doesn't
right now.

> So this seems pretty horrid.  It would be nice if _bt_preprocess_keys
> were smart enough to notice the contradictory nature of these quals,
> but I grant that (a) that routine is dauntingly complex already and
> (b) this doesn't seem like a common enough kind of query to be worth
> moving heaven and earth to optimize.

I don't think that it would be all that hard.

> However, I do think we should do something about the unstated
> assumption that _bt_preprocess_keys can emit the quals (for a given
> column) in any random order it feels like.  This is evidently not so,
> and it's probably capable of pessimizing other examples besides this
> one.  Unless we want to slow down _bt_check_compare by making it
> continue to examine quals after the first failure, we need to insist
> that required quals appear before non-required quals, thus ensuring
> that a comparison failure will clear continuescan if possible.

Obviously that general principle is important, but I don't think that
we fail to do the right thing anywhere else -- this seems likely to be
the only one.

Row comparisons are kind of a special case, both during preprocessing
and during the scan itself. I find it natural to blame this problem on
the fact that preprocessing makes exactly zero effort to detect
contradictory conditions that happen to involve a RowCompare. Making
non-zero effort in that direction would already be a big improvement.

> Even that looks a little nontrivial, because it seems like nbtree
> may be making some assumptions about the order in which array keys
> appear.  I see the bit about

> However, despite the rather over-the-top verbosity of commenting in
> _bt_advance_array_keys, it's far from clear why or how it depends on
> that.  So I feel a little stuck about what needs to be done here.

The dependency is fairly simple. In the presence of multiple arrays on
the same column, which must be contradictory/redundant, but cannot be
simplified solely due to lack of suitable cross-type support, we have
multiple arrays on the same index column. _bt_advance_array_keys wants
to deal with this by assuming that the scan key order matches the
array key order. After all, there is no indirection to disambiguate
which array belongs to which scan key. We make sure that
_bt_advance_array_keys expectations are never violated by having
preprocessing make sure that the arrays match input scan key order.
Preprocessing must also make sure that the output scan keys are in the
same order as the input scan keys.

I doubt that this detail makes the task of improving row compare
preprocessing any harder. It only comes up in scenarios involving
incomplete opfamilies, which is quite niche (obviously it's not a
factor in your test case, for example). But even if you assume that
incomplete opfamilies are common, it still doesn't seem like this
detail matters.

--
Peter Geoghegan

Re: Inefficient nbtree behavior with row-comparison quals

From

Tom Lane

Date:

11 May 2024, 20:21:36

Peter Geoghegan <pg@bowt.ie> writes:
> On Sat, May 11, 2024 at 3:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, despite the rather over-the-top verbosity of commenting in
>> _bt_advance_array_keys, it's far from clear why or how it depends on
>> that.  So I feel a little stuck about what needs to be done here.

> The dependency is fairly simple. In the presence of multiple arrays on
> the same column, which must be contradictory/redundant, but cannot be
> simplified solely due to lack of suitable cross-type support, we have
> multiple arrays on the same index column. _bt_advance_array_keys wants
> to deal with this by assuming that the scan key order matches the
> array key order.

I guess what is not clear to me is what you mean by "array key order".
Is that simply the order of entries in BTArrayKeyInfo[], or are there
additional assumptions/restrictions?

> There's another problem along these lines, that seems at least as bad:
> queries involving contradictory >= and <= quals aren't recognized as
> contradictory during preprocessing. There's no reason why
> _bt_preprocessing_keys couldn't detect that case; it just doesn't
> right now.

Ugh, how'd we miss that?  I can take a look at this, unless you're
on it already.

            regards, tom lane

Re: Inefficient nbtree behavior with row-comparison quals

From

Peter Geoghegan

Date:

11 May 2024, 20:30:48

On Sat, May 11, 2024 at 4:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The dependency is fairly simple. In the presence of multiple arrays on
> > the same column, which must be contradictory/redundant, but cannot be
> > simplified solely due to lack of suitable cross-type support, we have
> > multiple arrays on the same index column. _bt_advance_array_keys wants
> > to deal with this by assuming that the scan key order matches the
> > array key order.
>
> I guess what is not clear to me is what you mean by "array key order".
> Is that simply the order of entries in BTArrayKeyInfo[], or are there
> additional assumptions/restrictions?

I simply mean the order of the entries in BTArrayKeyInfo[].

> > There's another problem along these lines, that seems at least as bad:
> > queries involving contradictory >= and <= quals aren't recognized as
> > contradictory during preprocessing. There's no reason why
> > _bt_preprocessing_keys couldn't detect that case; it just doesn't
> > right now.
>
> Ugh, how'd we miss that?  I can take a look at this, unless you're
> on it already.

My draft skip scan/MDAM patch already deals with this in passing. So
you could say that I was already working on this. But I'm not sure
that I would actually say so myself; what I'm doing is tied to far
more complicated work.

I haven't attempted to write the kind of targeted fix that you're
thinking of. It might still be worth writing such a fix now. I
certainly have no objections.

--
Peter Geoghegan

Re: Inefficient nbtree behavior with row-comparison quals

From

Peter Geoghegan

Date:

11 May 2024, 20:45:57

On Sat, May 11, 2024 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Row comparisons are kind of a special case, both during preprocessing
> and during the scan itself. I find it natural to blame this problem on
> the fact that preprocessing makes exactly zero effort to detect
> contradictory conditions that happen to involve a RowCompare. Making
> non-zero effort in that direction would already be a big improvement.

BTW, I'm playing with the idea of eliminating the special case logic
around row comparisons scan keys through smarter preprocessing, of the
kind that the MDAM paper contemplates for the SQL standard row
constructor syntax (under its "Multi-Valued Predicates" section). I'm
not sure if I'll get around to that anytime soon, but that sort of
approach seems to have a lot to recommend it. Maybe nbtree shouldn't
even have to think about row comparisons, except perhaps during
preprocessing. (Actually, nbtree already doesn't have to deal with
equality row comparisons -- this scheme would mean that it wouldn't
have to deal with row comparison inequalities.)

--
Peter Geoghegan

Re: Inefficient nbtree behavior with row-comparison quals

From

Tom Lane

Date:

11 May 2024, 21:05:18

Peter Geoghegan <pg@bowt.ie> writes:
> On Sat, May 11, 2024 at 4:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> There's another problem along these lines, that seems at least as bad:
>>> queries involving contradictory >= and <= quals aren't recognized as
>>> contradictory during preprocessing. There's no reason why
>>> _bt_preprocessing_keys couldn't detect that case; it just doesn't
>>> right now.

>> Ugh, how'd we miss that?  I can take a look at this, unless you're
>> on it already.

> My draft skip scan/MDAM patch already deals with this in passing. So
> you could say that I was already working on this. But I'm not sure
> that I would actually say so myself; what I'm doing is tied to far
> more complicated work.

Hmm, I'm generally in favor of a lot of small patches rather than one
enormously complex one.  Isn't this point something that could be
broken out?

            regards, tom lane

Re: Inefficient nbtree behavior with row-comparison quals

From

Peter Geoghegan

Date:

11 May 2024, 21:30:45

On Sat, May 11, 2024 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, I'm generally in favor of a lot of small patches rather than one
> enormously complex one.  Isn't this point something that could be
> broken out?

That's not really possible here.

Skip scan generally works by consing up a special "skip" array +
equality scan key for attributes that lack input scan keys that use
the equality strategy (up to and including the least significant input
scan key's index attribute). In the case of quals like "WHERE sdate
BETWEEN '2024-01-01' and '2024-01-31'" (assume that there is no index
column before "sdate" here), we generate a skip scan key + skip array
for "sdate" early during preprocessing. This "array" works in mostly
the same way as arrays work in Postgres 17; the big difference is that
it procedurally generates its values, on-demand. The values are
generated from within given range of values -- often every possible
value for the underlying type. Often, but not always -- there's also
range predicates to consider.

Later preprocessing inside _bt_compare_array_scankey_args() will limit
the range of values that our magical skip array generates, when we try
to compare it against inequalities. So for the BETWEEN example, both
the >= scan key and the <= scan key are "eliminated", though in a way
that leaves us with a skip array + scan key that generates the
required range of values. It's very easy to make
_bt_compare_array_scankey_args() detect the case where a skip array's
upper and lower bounds are contradictory, which is how this is
handled.

That said, there'll likely be cases where this kind of transformation
isn't possible. I hope to be able to always set the scan keys up this
way, even in cases where skipping isn't expected to be useful (that
should be a problem for the index scan to deal with at runtime). But I
think I'll probably end up falling short of that ideal in some way or other.
Maybe that creates a need to independently detect contradictory >= and
<= scan keys (keys that don't go through this skip array preprocessing
path).

Obviously this is rather up in the air right now. As I said, I think
that we could directly fix this case quite easily, if we had to. And
I'm sympathetic; this is pretty horrible if you happen to run into it.

--
Peter Geoghegan

Re: Inefficient nbtree behavior with row-comparison quals

From

Peter Geoghegan

Date:

12 May 2024, 00:08:03

On Sat, May 11, 2024 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The dependency is fairly simple. In the presence of multiple arrays on
> the same column, which must be contradictory/redundant, but cannot be
> simplified solely due to lack of suitable cross-type support, we have
> multiple arrays on the same index column. _bt_advance_array_keys wants
> to deal with this by assuming that the scan key order matches the
> array key order. After all, there is no indirection to disambiguate
> which array belongs to which scan key.

Minor correction: there is an indirection. We can get from any
BTArrayKeyInfo entry to its so->arrayData[] scan key using the
BTArrayKeyInfo.scan_key offset. It'd just be inconvenient to do it
that way around within _bt_advance_array_keys, since
_bt_advance_array_keys's loop iterates through so->arrayData[] in the
usual order (just like in _bt_check_compare).

There is an assertion within _bt_advance_array_keys (and a couple of
other similar assertions elsewhere) that verify that everybody got it
right, though. The "Assert(array->scan_key == ikey);" assertion. So if
_bt_preprocess_keys ever violated the expectations held by
_bt_advance_array_keys, the problem would probably be detected before
long.

--
Peter Geoghegan