Thread: Inefficient nbtree behavior with row-comparison quals
I spent some time looking into the performance complaint at [1], which for the sake of self-containedness is CREATE TABLE t(a int, b int); INSERT INTO t(a, b) SELECT (random() * 123456)::int AS a, (random() * 123456)::int AS b FROM generate_series(1, 12345678); CREATE INDEX my_idx ON t USING BTREE (a, b); VACUUM ANALYZE t; explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123444) and a = 0 order by a, b; This produces something like Index Only Scan using my_idx on t (cost=0.43..8.46 rows=1 width=8) (actual time=475.713..475.713 rows=0 loops=1) Index Cond: ((ROW(a, b) > ROW(123450, 123444)) AND (a = 0)) Heap Fetches: 0 Buffers: shared hit=1 read=33731 Planning: Buffers: shared read=4 Planning Time: 0.247 ms Execution Time: 475.744 ms showing that we are reading practically the whole index, which is pretty sad considering the index conditions are visibly mutually contradictory. What's going on? I find that: 1. _bt_preprocess_keys, which is responsible for detecting mutually-contradictory index quals, fails to do so because it really just punts on row-comparison quals: it shoves them directly from input to output scankey array without any comparisons to other keys. (In particular, that causes the row-comparison qual to appear before the a = 0 one in the output scankey array.) 2. The initial-positioning logic in _bt_first chooses "a = 0" as determining where to start the scan, because it always prefers equality over inequality keys. (This seems reasonable.) 3. We really should stop the scan once we're past the last a = 0 index entry, which'd at least limit the damage. However, at both a = 0 and later entries, the row-comparison qual fails causing _bt_check_compare to return immediately, without examining the a = 0 key which is marked as SK_BT_REQFWD, and thus it does not clear "continuescan". Only when we finally reach an index entry for which the row-comparison qual succeeds do we notice that a = 0 is failing so we could stop the scan. So this seems pretty horrid. It would be nice if _bt_preprocess_keys were smart enough to notice the contradictory nature of these quals, but I grant that (a) that routine is dauntingly complex already and (b) this doesn't seem like a common enough kind of query to be worth moving heaven and earth to optimize. However, I do think we should do something about the unstated assumption that _bt_preprocess_keys can emit the quals (for a given column) in any random order it feels like. This is evidently not so, and it's probably capable of pessimizing other examples besides this one. Unless we want to slow down _bt_check_compare by making it continue to examine quals after the first failure, we need to insist that required quals appear before non-required quals, thus ensuring that a comparison failure will clear continuescan if possible. Even that looks a little nontrivial, because it seems like nbtree may be making some assumptions about the order in which array keys appear. I see the bit about * ... Some reordering of the keys * within each attribute may be done as a byproduct of the processing here. * That process must leave array scan keys (within an attribute) in the same * order as corresponding entries from the scan's BTArrayKeyInfo array info. which I could cope with, but then there's this down around line 2967: * Note: We do things this way around so that our arrays are * always in the same order as their corresponding scan keys, * even with incomplete opfamilies. _bt_advance_array_keys * depends on this. However, despite the rather over-the-top verbosity of commenting in _bt_advance_array_keys, it's far from clear why or how it depends on that. So I feel a little stuck about what needs to be done here. regards, tom lane [1] https://www.postgresql.org/message-id/CAAdwFAxBjyrYUkH7u%2BEceTaztd1QxBtBY1Teux8K%3DvcGKe%3D%3D-A%40mail.gmail.com
On Sat, May 11, 2024 at 3:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This produces something like > > Index Only Scan using my_idx on t (cost=0.43..8.46 rows=1 width=8) (actual time=475.713..475.713 rows=0 loops=1) > Index Cond: ((ROW(a, b) > ROW(123450, 123444)) AND (a = 0)) > Heap Fetches: 0 > Buffers: shared hit=1 read=33731 > Planning: > Buffers: shared read=4 > Planning Time: 0.247 ms > Execution Time: 475.744 ms > > showing that we are reading practically the whole index, which > is pretty sad considering the index conditions are visibly > mutually contradictory. What's going on? There's another problem along these lines, that seems at least as bad: queries involving contradictory >= and <= quals aren't recognized as contradictory during preprocessing. There's no reason why _bt_preprocessing_keys couldn't detect that case; it just doesn't right now. > So this seems pretty horrid. It would be nice if _bt_preprocess_keys > were smart enough to notice the contradictory nature of these quals, > but I grant that (a) that routine is dauntingly complex already and > (b) this doesn't seem like a common enough kind of query to be worth > moving heaven and earth to optimize. I don't think that it would be all that hard. > However, I do think we should do something about the unstated > assumption that _bt_preprocess_keys can emit the quals (for a given > column) in any random order it feels like. This is evidently not so, > and it's probably capable of pessimizing other examples besides this > one. Unless we want to slow down _bt_check_compare by making it > continue to examine quals after the first failure, we need to insist > that required quals appear before non-required quals, thus ensuring > that a comparison failure will clear continuescan if possible. Obviously that general principle is important, but I don't think that we fail to do the right thing anywhere else -- this seems likely to be the only one. Row comparisons are kind of a special case, both during preprocessing and during the scan itself. I find it natural to blame this problem on the fact that preprocessing makes exactly zero effort to detect contradictory conditions that happen to involve a RowCompare. Making non-zero effort in that direction would already be a big improvement. > Even that looks a little nontrivial, because it seems like nbtree > may be making some assumptions about the order in which array keys > appear. I see the bit about > However, despite the rather over-the-top verbosity of commenting in > _bt_advance_array_keys, it's far from clear why or how it depends on > that. So I feel a little stuck about what needs to be done here. The dependency is fairly simple. In the presence of multiple arrays on the same column, which must be contradictory/redundant, but cannot be simplified solely due to lack of suitable cross-type support, we have multiple arrays on the same index column. _bt_advance_array_keys wants to deal with this by assuming that the scan key order matches the array key order. After all, there is no indirection to disambiguate which array belongs to which scan key. We make sure that _bt_advance_array_keys expectations are never violated by having preprocessing make sure that the arrays match input scan key order. Preprocessing must also make sure that the output scan keys are in the same order as the input scan keys. I doubt that this detail makes the task of improving row compare preprocessing any harder. It only comes up in scenarios involving incomplete opfamilies, which is quite niche (obviously it's not a factor in your test case, for example). But even if you assume that incomplete opfamilies are common, it still doesn't seem like this detail matters. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sat, May 11, 2024 at 3:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, despite the rather over-the-top verbosity of commenting in >> _bt_advance_array_keys, it's far from clear why or how it depends on >> that. So I feel a little stuck about what needs to be done here. > The dependency is fairly simple. In the presence of multiple arrays on > the same column, which must be contradictory/redundant, but cannot be > simplified solely due to lack of suitable cross-type support, we have > multiple arrays on the same index column. _bt_advance_array_keys wants > to deal with this by assuming that the scan key order matches the > array key order. I guess what is not clear to me is what you mean by "array key order". Is that simply the order of entries in BTArrayKeyInfo[], or are there additional assumptions/restrictions? > There's another problem along these lines, that seems at least as bad: > queries involving contradictory >= and <= quals aren't recognized as > contradictory during preprocessing. There's no reason why > _bt_preprocessing_keys couldn't detect that case; it just doesn't > right now. Ugh, how'd we miss that? I can take a look at this, unless you're on it already. regards, tom lane
On Sat, May 11, 2024 at 4:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The dependency is fairly simple. In the presence of multiple arrays on > > the same column, which must be contradictory/redundant, but cannot be > > simplified solely due to lack of suitable cross-type support, we have > > multiple arrays on the same index column. _bt_advance_array_keys wants > > to deal with this by assuming that the scan key order matches the > > array key order. > > I guess what is not clear to me is what you mean by "array key order". > Is that simply the order of entries in BTArrayKeyInfo[], or are there > additional assumptions/restrictions? I simply mean the order of the entries in BTArrayKeyInfo[]. > > There's another problem along these lines, that seems at least as bad: > > queries involving contradictory >= and <= quals aren't recognized as > > contradictory during preprocessing. There's no reason why > > _bt_preprocessing_keys couldn't detect that case; it just doesn't > > right now. > > Ugh, how'd we miss that? I can take a look at this, unless you're > on it already. My draft skip scan/MDAM patch already deals with this in passing. So you could say that I was already working on this. But I'm not sure that I would actually say so myself; what I'm doing is tied to far more complicated work. I haven't attempted to write the kind of targeted fix that you're thinking of. It might still be worth writing such a fix now. I certainly have no objections. -- Peter Geoghegan
On Sat, May 11, 2024 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > Row comparisons are kind of a special case, both during preprocessing > and during the scan itself. I find it natural to blame this problem on > the fact that preprocessing makes exactly zero effort to detect > contradictory conditions that happen to involve a RowCompare. Making > non-zero effort in that direction would already be a big improvement. BTW, I'm playing with the idea of eliminating the special case logic around row comparisons scan keys through smarter preprocessing, of the kind that the MDAM paper contemplates for the SQL standard row constructor syntax (under its "Multi-Valued Predicates" section). I'm not sure if I'll get around to that anytime soon, but that sort of approach seems to have a lot to recommend it. Maybe nbtree shouldn't even have to think about row comparisons, except perhaps during preprocessing. (Actually, nbtree already doesn't have to deal with equality row comparisons -- this scheme would mean that it wouldn't have to deal with row comparison inequalities.) -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sat, May 11, 2024 at 4:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> There's another problem along these lines, that seems at least as bad: >>> queries involving contradictory >= and <= quals aren't recognized as >>> contradictory during preprocessing. There's no reason why >>> _bt_preprocessing_keys couldn't detect that case; it just doesn't >>> right now. >> Ugh, how'd we miss that? I can take a look at this, unless you're >> on it already. > My draft skip scan/MDAM patch already deals with this in passing. So > you could say that I was already working on this. But I'm not sure > that I would actually say so myself; what I'm doing is tied to far > more complicated work. Hmm, I'm generally in favor of a lot of small patches rather than one enormously complex one. Isn't this point something that could be broken out? regards, tom lane
On Sat, May 11, 2024 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm, I'm generally in favor of a lot of small patches rather than one > enormously complex one. Isn't this point something that could be > broken out? That's not really possible here. Skip scan generally works by consing up a special "skip" array + equality scan key for attributes that lack input scan keys that use the equality strategy (up to and including the least significant input scan key's index attribute). In the case of quals like "WHERE sdate BETWEEN '2024-01-01' and '2024-01-31'" (assume that there is no index column before "sdate" here), we generate a skip scan key + skip array for "sdate" early during preprocessing. This "array" works in mostly the same way as arrays work in Postgres 17; the big difference is that it procedurally generates its values, on-demand. The values are generated from within given range of values -- often every possible value for the underlying type. Often, but not always -- there's also range predicates to consider. Later preprocessing inside _bt_compare_array_scankey_args() will limit the range of values that our magical skip array generates, when we try to compare it against inequalities. So for the BETWEEN example, both the >= scan key and the <= scan key are "eliminated", though in a way that leaves us with a skip array + scan key that generates the required range of values. It's very easy to make _bt_compare_array_scankey_args() detect the case where a skip array's upper and lower bounds are contradictory, which is how this is handled. That said, there'll likely be cases where this kind of transformation isn't possible. I hope to be able to always set the scan keys up this way, even in cases where skipping isn't expected to be useful (that should be a problem for the index scan to deal with at runtime). But I think I'll probably end up falling short of that ideal in some way or other. Maybe that creates a need to independently detect contradictory >= and <= scan keys (keys that don't go through this skip array preprocessing path). Obviously this is rather up in the air right now. As I said, I think that we could directly fix this case quite easily, if we had to. And I'm sympathetic; this is pretty horrible if you happen to run into it. -- Peter Geoghegan
On Sat, May 11, 2024 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > The dependency is fairly simple. In the presence of multiple arrays on > the same column, which must be contradictory/redundant, but cannot be > simplified solely due to lack of suitable cross-type support, we have > multiple arrays on the same index column. _bt_advance_array_keys wants > to deal with this by assuming that the scan key order matches the > array key order. After all, there is no indirection to disambiguate > which array belongs to which scan key. Minor correction: there is an indirection. We can get from any BTArrayKeyInfo entry to its so->arrayData[] scan key using the BTArrayKeyInfo.scan_key offset. It'd just be inconvenient to do it that way around within _bt_advance_array_keys, since _bt_advance_array_keys's loop iterates through so->arrayData[] in the usual order (just like in _bt_check_compare). There is an assertion within _bt_advance_array_keys (and a couple of other similar assertions elsewhere) that verify that everybody got it right, though. The "Assert(array->scan_key == ikey);" assertion. So if _bt_preprocess_keys ever violated the expectations held by _bt_advance_array_keys, the problem would probably be detected before long. -- Peter Geoghegan