Thread: Adding skip scan (including MDAM style range skip scan) to nbtree
Attached is a POC patch that adds skip scan to nbtree. The patch teaches nbtree index scans to efficiently use a composite index on '(a, b)' for queries with a predicate such as "WHERE b = 5". This is feasible in cases where the total number of distinct values in the column 'a' is reasonably small (think tens or hundreds, perhaps even thousands for very large composite indexes). In effect, a skip scan treats this composite index on '(a, b)' as if it was a series of subindexes -- one subindex per distinct value in 'a'. We can exhaustively "search every subindex" using an index qual that behaves just like "WHERE a = ANY(<every possible 'a' value>) AND b = 5" would behave. This approach might be much less efficient than an index scan that can use an index on 'b' alone, but skip scanning can still be orders of magnitude faster than a sequential scan. The user may very well not have a dedicated index on 'b' alone, for whatever reason. Note that the patch doesn't just target these simpler "skip leading index column omitted from the predicate" cases. It's far more general than that -- skipping attributes (or what the patch refers to as skip arrays) can be freely mixed with SAOPs/conventional arrays, in any order you can think of. They can also be combined with inequalities to form range skip arrays. This patch is a direct follow-up to the Postgres 17 work that became commit 5bf748b8. Making everything work well together is an important design goal here. I'll talk more about that further down, and will show a benchmark example query that'll give a good general sense of the value of the patch with these more complicated cases. A note on terminology ===================== The terminology in this area has certain baggage. Many of us will recall the patch that implemented loose index scan. That patch also dubbed itself "skip scan", but that doesn't seem right to me (it's at odds with how other RDBMSs describe features in this area). I would like to address the issues with the terminology in this area now, to avoid creating further confusion. When I use the term "skip scan", I'm referring to a feature that's comparable to the skip scan features from Oracle and MySQL 8.0+. This *isn't* at all comparable to the feature that MySQL calls "loose index scan" -- don't confuse the two features. Loose index scan is a far more specialized technique than skip scan. It only applies within special scans that feed into a DISTINCT group aggregate. Whereas my skip scan patch isn't like that at all -- it's much more general. With my patch, nbtree has exactly the same contract with the executor/core code as before. There are no new index paths generated by the optimizer to make skip scan work, even. Skip scan isn't particularly aimed at improving group aggregates (though the benchmark I'll show happens to involve a group aggregate, simply because the technique works best with large and expensive index scans). My patch is an additive thing, that speeds up what we'd currently refer to as full index scans (as well as range index scans that currently do a "full scan" of a range/subset of an index). These index paths/index scans can no longer really be called "full index scans", of course, but they're still logically the same index paths as before. MDAM and skip scan ================== As I touched on already, the patch actually implements a couple of related optimizations. "Skip scan" might be considered one out of the several optimizations from the 1995 paper "Efficient Search of Multidimensional B-Trees" [1] -- the paper describes skip scan under its "Missing Key Predicate" subsection. I collectively refer to the optimizations from the paper as the "MDAM techniques". Alternatively, you could define these MDAM techniques as each implementing some particular flavor of skip scan, since they all do rather similar things under the hood. In fact, that's how I've chosen to describe things in my patch: it talks about skip scan, and about range skip scan, which is considered a minor variant of skip scan. (Note that the term "skip scan" is never used in the MDAM paper.) MDAM is short for "multidimensional access method". In the context of the paper, "dimension" refers to dimensions in a decision support system. These dimensions are represented by low cardinality columns, each of which appear in a large composite B-Tree index. The emphasis in the paper (and for my patch) is DSS and data warehousing; OLTP apps typically won't benefit as much. Note: Loose index scan *isn't* described by the paper at all. I also wouldn't classify loose index scan as one of the MDAM techniques. I think of it as being in a totally different category, due to the way that it applies semantic information. No MDAM technique will ever apply high-level semantic information about what is truly required by the plan tree, one level up. And so my patch simply teaches nbtree to find the most efficient way of navigating through an index, based solely on information that is readily available to the scan. The same principles apply to all of the other MDAM techniques; they're basically all just another flavor of skip scan (that do some kind of clever preprocessing/transformation that enables reducing the scan to a series of disjunctive accesses, and that could be implemented using the new abstraction I'm calling skip arrays). The paper more or less just applies one core idea, again and again. It's surprising how far that one idea can take you. But it is still just one core idea (don't overlook that point). Range skip scan --------------- To me, the most interesting MDAM technique is probably one that I refer to as "range skip scan" in the patch. This is the technique that the paper introduces first, in its "Intervening Range Predicates" subsection. The best way of explaining it is through an example (you could also just read the paper, which has an example of its own). Imagine a table with just one index: a composite index on "(pdate, customer_id)". Further suppose we have a query such as: SELECT * FROM payments WHERE pdate BETWEEN '2024-01-01' AND '2024-01-30' AND customer_id = 5; -- both index columns (pdate and customer_id) appear in predicate The patch effectively makes the nbtree code execute the index scan as if the query had been written like this instead: SELECT * FROM payments WHERE pdate = ANY ('2024-01-01', '2024-01-02', ..., '2024-01-30') AND customer_id = 5; The use of a "range skip array" within nbtree allows the scan to skip when that makes sense, locating the next date with customer_id = 5 each time (we might skip over many irrelevant leaf pages each time). The scan must also *avoid* skipping when it *doesn't* make sense. As always (since commit 5bf748b8 went in), whether and to what extent we skip using array keys depends in large part on the physical characteristics of the index at runtime. If the tuples that we need to return are all clustered closely together, across only a handful of leaf pages, then we shouldn't be skipping at all. When skipping makes sense, we should skip constantly. I'll discuss the trade-offs in this area a little more below, under "Design". Using multiple MDAM techniques within the same index scan (includes benchmark) ------------------------------------------------------------------------------ I recreated the data in the MDAM paper's "sales" table by making inferences from the paper. It's very roughly the same data set as the paper (close enough to get the general idea across). The table size is about 51GB, and the index is about 25GB (most of the attributes from the table are used as index columns). There is nothing special about this data set -- I just thought it would be cool to "recreate" the queries from the paper, as best I could. Thought that this approach might make my points about the design easier to follow. The index we'll be using for this can be created via: "create index mdam_idx on sales_mdam_paper(dept, sdate, item_class, store)". Again, this is per the paper. It's also the order that the columns appear in every WHERE clause in every query from the paper. (That said, the particular column order from the index definition mostly doesn't matter. Every index column is a low cardinality column, so unless the order used completely obviates the need to skip a column that would otherwise need to be skipped, such as "dept", the effect on query execution time from varying column order is in the noise. Obviously that's very much not how users are used to thinking about composite indexes.) The MDAM paper has numerous example queries, each of which builds on the last, adding one more complication each time -- each of which is addressed by another MDAM technique. The query I'll focus on here is an example query that's towards the end of the paper, and so combines multiple techniques together -- it's the query that appears in the "IN Lists" subsection: select dept, sdate, item_class, store, sum(total_sales) from sales_mdam_paper where -- omitted: leading "dept" column from composite index sdate between '1995-06-01' and '1995-06-30' and item_class in (20, 35, 50) and store in (200, 250) group by dept, sdate, item_class, store order by dept, sdate, item_class, store; On HEAD, when we run this query we either get a sequential scan (which is very slow) or a full index scan (which is almost as slow). Whereas with the patch, nbtree will execute the query as a succession of a few thousand very selective primitive index scans (which usually only scan one leaf page, though some may scan two neighboring leaf pages). Results: The full index scan on HEAD takes about 32 seconds. With the patch, the query takes just under 52ms to execute. That works out to be about 630x faster with the patch. See the attached SQL file for full details. It provides all you'll need to recreate this test result with the patch. Nobody would put up with such an inefficient full index scan in the first place, so the behavior on HEAD is not really a sensible baseline -- 630x isn't very meaningful. I could have come up with a case that showed an even larger improvement if I'd felt like it, but that wouldn't have proven anything. The important point is that the patch makes a huge composite index like the one I've built for this actually make sense, for the first time. So we're not so much making something faster as enabling a whole new approach to indexing -- particularly for data warehousing use cases. The way that Postgres DBAs choose which indexes they'll need to create is likely to be significantly changed by this optimization. I'll break down how this is possible. This query makes use of 3 separate MDAM techniques: 1. A "simple" skip scan (on "dept"). 2. A "range" skip scan (on "sdate"). 3. The pair of IN() lists/SAOPs on item_class and on store. (Nothing new here, except that nbtree needs these regular SAOP arrays to roll over the higher-order skip arrays to trigger moving on to the next dept/date.) Internally, we're just doing a series of several thousand distinct non-overlapping accesses, in index key space order (so as to preserve the appearance of one continuous index scan). These accesses starts out like this: dept=INT_MIN, date='1995-06-01', item_class=20, store=200 (Here _bt_advance_array_keys discovers that the actual lowest dept is 1, not INT_MIN) dept=1, date='1995-06-01', item_class=20, store=200 dept=1, date='1995-06-01', item_class=20, store=250 dept=1, date='1995-06-01', item_class=35, store=200 dept=1, date='1995-06-01', item_class=35, store=250 ... (Side note: as I mentioned, each of the two "store" values usually appear together on the same leaf page in practice. Arguably I should have shown 2 lines/accesses here (for "dept=1"), rather than showing 4. The 4 "dept=1" lines shown required only 2 primitive index scans/index descents/leaf page reads. Disjunctive accesses don't necessarily map 1:1 with primitive/physical index scans.) About another ten thousand similar accesses occur (omitted for brevity). Execution of the scan within nbtree finally ends with these primitive index scans/accesses: ... dept=100, date='1995-06-30', item_class=50, store=250 dept=101, date='1995-06-01', item_class=20, store=200 STOP There is no "dept=101" entry in the index (the highest department in the index happens to be 100). The index scan therefore terminates at this point, having run out of leaf pages to scan (we've reached the rightmost point of the rightmost leaf page, as the scan attempts to locate non-existent dept=101 tuples). Design ====== Since index scans with skip arrays work just like index scans with regular arrays (as of Postgres 17), naturally, there are no special restrictions. Associated optimizer index paths have path keys, and so could (just for example) appear in a merge join, or feed into a group aggregate, while avoiding a sort node. Index scans that skip could also feed into a relocatable cursor. As I mentioned already, the patch adds a skipping mechanism that is purely an additive thing. I think that this will turn out to be an important enabler of using the optimizations, even when there's much uncertainty about how much they'll actually help at runtime. Optimizer --------- We make a broad assumption that skipping is always to our advantage during nbtree preprocessing -- preprocessing generates as many skip arrays as could possibly make sense based on static rules (rules that don't apply any kind of information about data distribution). Of course, skipping isn't necessarily the best approach in all cases, but that's okay. We only actually skip when physical index characteristics show that it makes sense. The real decisions about skipping are all made dynamically. That approach seems far more practicable than preempting the problem during planning or during nbtree preprocessing. It seems like it'd be very hard to model the costs statistically. We need revisions to btcostestimate, of course, but the less we can rely on btcostestimate the better. As I said, there are no new index paths generated by the optimizer for any of this. What do you call an index scan where 90% of all index tuples are 1 of only 3 distinct values, while the remaining 10% of index tuples are all perfectly unique in respect of a leading column? Clearly the best strategy when skipping using the leading column to "use skip scan for 90% of the index, and use a conventional range scan for the remaining 10%". Skipping generally makes sense, but we legitimately need to vary our strategy *during the same index scan*. It makes no sense to think of skip scan as a discrete sort of index scan. I have yet to prove that always having the option of skipping (even when it's very unlikely to help) really does "come for free" -- for now I'm just asserting that that's possible. I'll need proof. I expect to hear some principled skepticism on this point. It's probably not quite there in this v1 of the patch -- there'll be some regressions (I haven't looked very carefully just yet). However, we seem to already be quite close to avoiding regressions from excessive/useless skipping. Extensible infrastructure/support functions ------------------------------------------- Currently, the patch only supports skip scan for a subset of all opclasses -- those that have the required support function #6, or "skip support" function. This provides the opclass with (among other things) a way to increment the current skip array value (or to decrement it, in the case of backward scans). In practice we only have this for a handful of discrete integer (and integer-ish) types. Note that the patch currently cannot skip for an index column that happens to be text. Note that even this v1 supports skip scans that use unsupported types, provided that the input opclass of the specific columns we'll need to skip has support. The patch should be able to support every type/opclass as a target for skipping, regardless of whether an opclass support function happened to be available. That could work by teaching the nbtree code to have explicit probes for the next skip array value in the index, only then combining that new value with the qual from the input scan keys/query. I've put that off for now because it seems less important -- it doesn't really affect anything I've said about the core design, which is what I'm focussing on for now. It makes sense to use increment/decrement whenever feasible, even though it isn't strictly necessary (or won't be, once the patch has the required explicit probe support). The only reason to not apply increment/decrement opclass skip support (that I can see) is because it just isn't practical (this is generally the case for continuous types). While it's slightly onerous to have to invent all this new opclass infrastructure, it definitely makes sense. There is a performance advantage to having skip arrays that can increment through each distinct possible indexable value (this increment/decrement stuff comes from the MDAM paper). The MDAM techniques inherently work best when "skipping" columns of discrete types like integer and date, which is why the paper has examples that all look like that. If you look at my example query and its individual accesses, you'll realize why this is so. Thoughts? [1] https://vldb.org/conf/1995/P710.PDF -- Peter Geoghegan
Attachment
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
From
Aleksander Alekseev
Date:
Hi Peter, > Attached is a POC patch that adds skip scan to nbtree. The patch > teaches nbtree index scans to efficiently use a composite index on > '(a, b)' for queries with a predicate such as "WHERE b = 5". This is > feasible in cases where the total number of distinct values in the > column 'a' is reasonably small (think tens or hundreds, perhaps even > thousands for very large composite indexes). > > [...] > > Thoughts? Many thanks for working on this. I believe it is an important feature and it would be great to deliver it during the PG18 cycle. I experimented with the patch and here are the results I got so far. Firstly, it was compiled on Intel MacOS and ARM Linux. All the tests pass just fine. Secondly, I tested the patch manually using a release build on my Raspberry Pi 5 and the GUCs that can be seen in [1]. Test 1 - simple one. ``` CREATE TABLE test1(c char, n bigint); CREATE INDEX test1_idx ON test1 USING btree(c,n); INSERT INTO test1 SELECT chr(ascii('a') + random(0,2)) AS c, random(0, 1_000_000_000) AS n FROM generate_series(0, 1_000_000); EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000; ``` Test 2 - a more complicated one. ``` CREATE TABLE test2(c1 char, c2 char, n bigint); CREATE INDEX test2_idx ON test2 USING btree(c1,c2,n); INSERT INTO test2 SELECT chr(ascii('a') + random(0,2)) AS c1, chr(ascii('a') + random(0,2)) AS c2, random(0, 1_000_000_000) AS n FROM generate_series(0, 1_000_000); EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test2 WHERE n > 900_000_000; ``` Test 3 - to see how it works with covering indexes. ``` CREATE TABLE test3(c char, n bigint, s text DEFAULT 'text_value' || n); CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s); INSERT INTO test3 SELECT chr(ascii('a') + random(0,2)) AS c, random(0, 1_000_000_000) AS n, 'text_value_' || random(0, 1_000_000_000) AS s FROM generate_series(0, 1_000_000); EXPLAIN [ANALYZE] SELECT s FROM test3 WHERE n < 1000; ``` In all the cases the patch worked as expected. I noticed that with the patch we choose Index Only Scans for Test 1 and without the patch - Parallel Seq Scan. However the Parallel Seq Scan is 2.4 times faster. Before the patch the query takes 53 ms, after the patch - 127 ms. I realize this could be just something specific to my hardware and/or amount of data. Do you think this is something that was expected or something worth investigating further? I haven't looked at the code yet. [1]: https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh -- Best regards, Aleksander Alekseev
On Tue, Jul 2, 2024 at 8:53 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > CREATE TABLE test1(c char, n bigint); > CREATE INDEX test1_idx ON test1 USING btree(c,n); The type "char" (note the quotes) is different from char(1). It just so happens that v1 has support for skipping attributes that use the default opclass for "char", without support for char(1). If you change your table definition to CREATE TABLE test1(c "char", n bigint), then your example queries can use the optimization. This makes a huge difference. > EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000; For example, this first test query goes from needing a full index scan that has 5056 buffer hits to a skip scan that requires only 12 buffer hits. > I noticed that with the patch we choose Index Only Scans for Test 1 > and without the patch - Parallel Seq Scan. However the Parallel Seq > Scan is 2.4 times faster. Before the patch the query takes 53 ms, > after the patch - 127 ms. I'm guessing that it's actually much faster once you change the leading column to the "char" type/default opclass. > I realize this could be just something > specific to my hardware and/or amount of data. The selfuncs.c costing current has a number of problems. One problem is that it doesn't know that some opclasses/types don't support skipping at all. That particular problem should be fixed on the nbtree side; nbtree should support skipping regardless of the opclass that the skipped attribute uses (while still retaining the new opclass support functions for a subset of types where we expect it to make skip scans somewhat faster). -- Peter Geoghegan
On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg@bowt.ie> wrote: > > EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000; > > For example, this first test query goes from needing a full index scan > that has 5056 buffer hits to a skip scan that requires only 12 buffer > hits. Actually, looks like that's an invalid result. The "char" opclass support function appears to have bugs. My testing totally focussed on types like integer, date, and UUID. The "char" opclass was somewhat of an afterthought. Will fix "char" skip support for v2. -- Peter Geoghegan
On Tue, Jul 2, 2024 at 9:40 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000; > > > > For example, this first test query goes from needing a full index scan > > that has 5056 buffer hits to a skip scan that requires only 12 buffer > > hits. > > Actually, looks like that's an invalid result. The "char" opclass > support function appears to have bugs. Attached v2 fixes this bug. The problem was that the skip support function used by the "char" opclass assumed signed char comparisons, even though the authoritative B-Tree comparator (support function 1) uses signed comparisons (via uint8 casting). A simple oversight. Your test cases will work with this v2, provided you use "char" (instead of unadorned char) in the create table statements. Another small change in v2: I added a DEBUG2 message to nbtree preprocessing, indicating the number of attributes that we're going to skip. This provides an intuitive way to see whether the optimizations are being applied in the first place. That should help to avoid further confusion like this as the patch continues to evolve. Support for char(1) doesn't seem feasible within the confines of a skip support routine. Just like with text (which I touched on in the introductory email), this will require teaching nbtree to perform explicit next-key probes. An approach based on explicit probes is somewhat less efficient in some cases, but it should always work. It's impractical to write opclass support that (say) increments a char value 'a' to 'b'. Making that approach work would require extensive cooperation from the collation provider, and some knowledge of encoding, which just doesn't make sense (if it's possible at all). I don't have the problem with "char" because it isn't a collatable type (it is essentially the same thing as an uint8 integer type, except that it outputs printable ascii characters). FWIW, your test cases don't seem like particularly good showcases for the patch. The queries you came up with require a relatively large amount of random I/O when accessing the heap, which skip scan will never help with -- so skip scan is a small win (at least relative to an unoptimized full index scan). Obviously, no skip scan can ever avoid any required heap accesses compared to a naive full index scan (loose index scan *does* have that capability, which is possible only because it applies semantic information in a way that's very different). FWIW, a more sympathetic version of your test queries would have involved something like "WHERE n = 900_500_000". That would allow the implementation to perform a series of *selective* primitive index scans (one primitive index scan per "c" column/char grouping). That change has the effect of allowing the scan to skip over many irrelevant leaf pages, which is of course the whole point of skip scan. It also makes the scan will require far fewer heap accesses, so heap related costs no longer drown out the nbtree improvements. -- Peter Geoghegan
Attachment
On Tue, Jul 2, 2024 at 12:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached v2 fixes this bug. The problem was that the skip support > function used by the "char" opclass assumed signed char comparisons, > even though the authoritative B-Tree comparator (support function 1) > uses signed comparisons (via uint8 casting). A simple oversight. Although v2 gives correct answers to the queries, the scan itself performs an excessive amount of leaf page accesses. In short, it behaves just like a full index scan would, even though we should expect it to skip over significant runs of the index. So that's another bug. It looks like the queries you posted have a kind of adversarial quality to them, as if they were designed to confuse the implementation. Was it intentional? Did you take them from an existing test suite somewhere? The custom instrumentation I use to debug these issues shows: _bt_readpage: 🍀 1981 with 175 offsets/tuples (leftsib 4032, rightsib 3991) ➡️ _bt_readpage first: (c, n)=(b, 998982285), TID='(1236,173)', 0x7f1464fe9fc0, from non-pivot offnum 2 started page _bt_readpage final: , (nil), continuescan high key check did not set so->currPos.moreRight=false ➡️ 🟢 _bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173, nmatching: 174 ✅ _bt_readpage: 🍀 3991 with 175 offsets/tuples (leftsib 1981, rightsib 9) ➡️ _bt_readpage first: (c, n)=(b, 999474517), TID='(4210,9)', 0x7f1464febfc8, from non-pivot offnum 2 started page _bt_readpage final: , (nil), continuescan high key check did not set so->currPos.moreRight=false ➡️ 🟢 _bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173, nmatching: 174 ✅ _bt_readpage: 🍀 9 with 229 offsets/tuples (leftsib 3991, rightsib 3104) ➡️ _bt_readpage first: (c, n)=(c, 1606), TID='(882,68)', 0x7f1464fedfc0, from non-pivot offnum 2 started page _bt_readpage final: , (nil), continuescan high key check did not set so->currPos.moreRight=false ➡️ 🟢 _bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌ _bt_readpage: 🍀 3104 with 258 offsets/tuples (leftsib 9, rightsib 1685) ➡️ _bt_readpage first: (c, n)=(c, 706836), TID='(3213,4)', 0x7f1464feffc0, from non-pivot offnum 2 started page _bt_readpage final: , (nil), continuescan high key check did not set so->currPos.moreRight=false ➡️ 🟢 _bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌ *** SNIP, many more "nmatching: 0" pages appear after these two *** The final _bt_advance_array_keys call for leaf page 3991 should be scheduling a new primitive index scan (i.e. skipping), but that never happens. Not entirely sure why that is, but it probably has something to do with _bt_advance_array_keys failing to hit the "has_required_opposite_direction_only" path for determining if another primitive scan is required. You're using an inequality required in the opposite-to-scan-direction here, so that path is likely to be relevant. -- Peter Geoghegan
On Tue, Jul 2, 2024 at 12:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > Although v2 gives correct answers to the queries, the scan itself > performs an excessive amount of leaf page accesses. In short, it > behaves just like a full index scan would, even though we should > expect it to skip over significant runs of the index. So that's > another bug. Hit "send" too soon. I simply forgot to run "alter table test1 alter column c type "char";" before running the query. So, I was mistaken about there still being a bug in v2. The issue here is that we don't have support for the underlying type, char(1) -- nothing more. v2 of the patch with your query 1 (when changed to use the "char" type/opclass instead of the currently unsupported char(1) type/opclass) performs 395 index related buffer hits, and 5406 heap block accesses. Whereas it's 3833 index buffer hits with master (naturally, the same 5406 heap accesses are required with master). In short, this query isn't particularly sympathetic to the patch. Nor is it unsympathetic. -- Peter Geoghegan
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
From
Aleksander Alekseev
Date:
Hi Peter, > It looks like the queries you posted have a kind of adversarial > quality to them, as if they were designed to confuse the > implementation. Was it intentional? To some extent. I merely wrote several queries that I would expect should benefit from skip scans. Since I didn't look at the queries you used there was a chance that I will hit something interesting. > Attached v2 fixes this bug. The problem was that the skip support > function used by the "char" opclass assumed signed char comparisons, > even though the authoritative B-Tree comparator (support function 1) > uses signed comparisons (via uint8 casting). A simple oversight. Your > test cases will work with this v2, provided you use "char" (instead of > unadorned char) in the create table statements. Thanks for v2. > If you change your table definition to CREATE TABLE test1(c "char", n > bigint), then your example queries can use the optimization. This > makes a huge difference. You are right, it does. Test1 takes 33.7 ms now (53 ms before the path, x1.57) Test3 I showed before contained an error in the table definition (Postgres can't do `n bigint, s text DEFAULT 'text_value' || n`). Here is the corrected test: ``` CREATE TABLE test3(c "char", n bigint, s text); CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s); INSERT INTO test3 SELECT chr(ascii('a') + random(0,2)) AS c, random(0, 1_000_000_000) AS n, 'text_value_' || random(0, 1_000_000_000) AS s FROM generate_series(0, 1_000_000); EXPLAIN ANALYZE SELECT s FROM test3 WHERE n < 10_000; ``` It runs fast (< 1 ms) and uses the index, as expected. Test2 with "char" doesn't seem to benefit from the patch anymore (pretty sure it did in v1). It always chooses Parallel Seq Scans even if I change the condition to `WHERE n > 999_995_000` or `WHERE n = 999_997_362`. Is it an expected behavior? I also tried Test4 and Test5. In Test4 I was curious if scip scans work properly with functional indexes: ``` CREATE TABLE test4(d date, n bigint); CREATE INDEX test4_idx ON test4 USING btree(extract(year from d),n); INSERT INTO test4 SELECT ('2024-' || random(1,12) || '-' || random(1,28)) :: date AS d, random(0, 1_000_000_000) AS n FROM generate_series(0, 1_000_000); EXPLAIN ANALYZE SELECT COUNT(*) FROM test4 WHERE n > 900_000_000; ``` The query uses Index Scan, however the performance is worse than with Seq Scan chosen before the patch. It doesn't matter if I choose '>' or '=' condition. Test5 checks how skip scans work with partial indexes: ``` CREATE TABLE test5(c "char", n bigint); CREATE INDEX test5_idx ON test5 USING btree(c, n) WHERE n > 900_000_000; INSERT INTO test5 SELECT chr(ascii('a') + random(0,2)) AS c, random(0, 1_000_000_000) AS n FROM generate_series(0, 1_000_000); EXPLAIN ANALYZE SELECT COUNT(*) FROM test5 WHERE n > 950_000_000; ``` It runs fast and choses Index Only Scan. But then I discovered that without the patch Postgres also uses Index Only Scan for this query. I didn't know it could do this - what is the name of this technique? The query takes 17.6 ms with the patch, 21 ms without the patch. Not a huge win but still. That's all I have for now. -- Best regards, Aleksander Alekseev
On Fri, Jul 5, 2024 at 7:04 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > Test2 with "char" doesn't seem to benefit from the patch anymore > (pretty sure it did in v1). It always chooses Parallel Seq Scans even > if I change the condition to `WHERE n > 999_995_000` or `WHERE n = > 999_997_362`. Is it an expected behavior? The "char" opclass's skip support routine was totally broken in v1, so its performance isn't really relevant. In any case v2 didn't make any changes to the costing, so I'd expect it to use exactly the same query plan as v1. > The query uses Index Scan, however the performance is worse than with > Seq Scan chosen before the patch. It doesn't matter if I choose '>' or > '=' condition. That's because the index has a leading/skipped column of type "numeric", which isn't a supported type just yet (a supported B-Tree opclass, actually). The optimization is effective if you create the expression index with a cast to integer: CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n); This performs much better. Now I see "DEBUG: skipping 1 index attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the optimization has in fact been used as expected. There are far fewer buffers hit with this version of your test4, which also indicates that the optimization has been effective. Note that the original numeric expression index test4 showed "DEBUG: skipping 0 index attributes" when the test query ran, which indicated that the optimization couldn't be used. I suggest that you look out for that, by running "set client_min_messages to debug2;" from psql when testing the patch. > It runs fast and choses Index Only Scan. But then I discovered that > without the patch Postgres also uses Index Only Scan for this query. I > didn't know it could do this - what is the name of this technique? It is a full index scan. These have been possible for many years now (possibly well over 20 years). Arguably, the numeric case that didn't use the optimization (your test4) should have been costed as a full index scan, but it wasn't -- that's why you didn't get a faster sequential scan, which would have made a little bit more sense. In general, the costing changes in the patch are very rough. That said, this particular problem (the test4 numeric issue) should be fixed by inventing a way for nbtree to use skip scan with types that lack skip support. It's not primarily a problem with the costing. At least not in my mind. -- Peter Geoghegan
On Fri, Jul 5, 2024 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n); > > This performs much better. Now I see "DEBUG: skipping 1 index > attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT > COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the > optimization has in fact been used as expected. There are far fewer > buffers hit with this version of your test4, which also indicates that > the optimization has been effective. Actually, with an index-only scan it is 281 buffer hits (including some small number of VM buffer hits) with the patch, versus 2736 buffer hits on master. So a big change to the number of index page accesses only. If you use a plain index scan for this, then the cost of random heap accesses totally dominates, so skip scan cannot possibly give much benefit. Even a similar bitmap scan requires 4425 distinct heap page accesses, which is significantly more than the total number of index pages in the index. 4425 heap pages is almost the entire table; the table consists of 4480 mainfork blocks. This is a very nonselective query. It's not at all surprising that this query (and others like it) hardly benefit at all, except when we can use an index-only scan (so that the cost of heap accesses doesn't totally dominate). -- Peter Geoghegan
Hi, Since I'd like to understand the skip scan to improve the EXPLAIN output for multicolumn B-Tree Index[1], I began to try the skip scan with some queries and look into the source code. I have some feedback and comments. (1) At first, I was surprised to look at your benchmark result because the skip scan index can improve much performance. I agree that there are many users to be happy with the feature for especially OLAP use-case. I expected to use v18. (2) I found the cost is estimated to much higher if the number of skipped attributes is more than two. Is it expected behavior? # Test result. The attached file is the detail of tests. -- Index Scan -- The actual time is low since the skip scan works well -- But the cost is higher than one of seqscan EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991loops=1) Output: id1, id2, id3, value Index Cond: (test.id3 = 101) Buffers: shared hit=4402 Planning: Buffers: shared hit=7 Planning Time: 0.234 ms Execution Time: 15.711 ms (8 rows) -- Seq Scan -- actual time is high, but the cost is lower than one of the above Index Scan. EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..12676.73 rows=984 width=20) (actual time=0.856..113.861 rows=991 loops=1) Output: id1, id2, id3, value Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=6370 -> Parallel Seq Scan on public.test (cost=0.00..11578.33 rows=410 width=20) (actual time=0.061..102.016 rows=330 loops=3) Output: id1, id2, id3, value Filter: (test.id3 = 101) Rows Removed by Filter: 333003 Buffers: shared hit=6370 Worker 0: actual time=0.099..98.014 rows=315 loops=1 Buffers: shared hit=2066 Worker 1: actual time=0.054..97.162 rows=299 loops=1 Buffers: shared hit=1858 Planning: Buffers: shared hit=19 Planning Time: 0.194 ms Execution Time: 114.129 ms (18 rows) I look at btcostestimate() to find the reason and found the bound quals and cost.num_sa_scans are different from my expectation. My assumption is * bound quals is id3=XXX (and id1 and id2 are skipped attributes) * cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans per skipped attribute) But it's wrong. The above index scan result is * bound quals is NULL * cost.num_sa_scans = 1 As I know you said the below, but I'd like to know the above is expected or not. > That approach seems far more practicable than preempting the problem > during planning or during nbtree preprocessing. It seems like it'd be > very hard to model the costs statistically. We need revisions to > btcostestimate, of course, but the less we can rely on btcostestimate > the better. As I said, there are no new index paths generated by the > optimizer for any of this. I couldn't understand why there is the below logic well. btcostestimate() (...omit...) if (indexcol != iclause->indexcol) { /* no quals at all for indexcol */ found_skip = true; if (index->pages < 100) break; num_sa_scans += 10 * (indexcol - iclause->indexcol); // why add minus value? continue; // why skip to add bound quals? } (3) Currently, there is an assumption that "there will be 10 primitive index scans per skipped attribute". Is any chance to use pg_stats.n_distinct? [1] Improve EXPLAIN output for multicolumn B-Tree Index https://www.postgresql.org/message-id/flat/TYWPR01MB1098260B694D27758FE2BA46FB1C92%40TYWPR01MB10982.jpnprd01.prod.outlook.com Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote: > Since I'd like to understand the skip scan to improve the EXPLAIN output > for multicolumn B-Tree Index[1], I began to try the skip scan with some > queries and look into the source code. Thanks for the review! Attached is v3, which generalizes skip scan, allowing it to work with opclasses/types that lack a skip support routine. In other words, v3 makes skip scan work for all types, including continuous types, where it's impractical or infeasible to add skip support. So now important types like text and numeric also get the skip scan optimization (it's not just discrete types like integer and date, as in previous versions). I feel very strongly that everything should be implemented as part of the new skip array abstraction; the patch should only add the concept of skip arrays, which should work just like SAOP arrays. We should avoid introducing any special cases. In short, _bt_advance_array_keys should work in exactly the same way as it does as of Postgres 17 (except for a few representational differences for skip arrays). This seems essential because _bt_advance_array_keys inherently need to be able to trigger moving on to the next skip array value when it reaches the end of a SAOP array (and vice-versa). And so it just makes sense to abstract-away the differences, hiding the difference in lower level code. I have described the new _bt_first behavior that is now available in this new v3 of the patch as "adding explicit next key probes". While v3 does make new changes to _bt_first, it's not really a special kind of index probe. v3 invents new sentinel values instead. The use of sentinels avoids inventing true special cases: the values -inf, +inf, as well as variants of = that use a real datum value, but match on the next key in the index. These new = variants can be thought of as "+infinitesimal" values. So when _bt_advance_array_keys has to "increment" the numeric value 5.0, it sets the scan key to the value "5.0 +infinitesimal". There can never be any matching tuples in the index (just like with -inf sentinel values), but that doesn't matter. So the changes v3 makes to _bt_first doesn't change the basic conceptual model. The added complexity is kept to a manageable level, particularly within _bt_advance_array_keys, which is already very complicated. To help with testing and review, I've added another temporary testing GUC to v3: skipscan_skipsupport_enabled. This can be set to "false" to avoid using skip support, even where available. The GUC makes it easy to measure how skip support routines can help performance (with discrete types like integer and date). > I found the cost is estimated to much higher if the number of skipped attributes > is more than two. Is it expected behavior? Yes and no. Honestly, the current costing is just placeholder code. It is totally inadequate. I'm not surprised that you found problems with it. I just didn't put much work into it, because I didn't really know what to do. > # Test result. The attached file is the detail of tests. > > -- Index Scan > -- The actual time is low since the skip scan works well > -- But the cost is higher than one of seqscan > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991loops=1) > Output: id1, id2, id3, value > Index Cond: (test.id3 = 101) > Buffers: shared hit=4402 > Planning: > Buffers: shared hit=7 > Planning Time: 0.234 ms > Execution Time: 15.711 ms > (8 rows) This is a useful example, because it shows the difficulty with the costing. I ran this query using my own custom instrumentation of the scan. I saw that we only ever manage to skip ahead by perhaps 3 leaf pages at a time, but we still come out ahead. As you pointed out, it's ~7.5x faster than the sequential scan, but not very different to the equivalent full index scan. At least not very different in terms of leaf page accesses. Why should we win by this much, for what seems like a marginal case for skip scan? Even cases where "skipping" doesn't manage to skip any leaf pages can still benefit from skipping *index tuples* -- there is more than one kind of skipping to consider. That is, the patch helps a lot with some (though not all) cases where I didn't really expect that to happen: the Postgres 17 SAOP tuple skipping code (the code in _bt_checkkeys_look_ahead, and the related code in _bt_readpage) helps quite a bit in "marginal" skip scan cases, even though it wasn't really designed for that purpose (it was added to avoid regressions in SAOP array scans for the Postgres 17 work). I find that some queries using my original example test case are about twice as fast as an equivalent full index scan, even when only the fourth and final index column is used in the query predicate. The scan can't even skip a single leaf page at a time, and yet we still win by a nice amount. We win, though it is almost by mistake! This is mostly a good thing. Both for the obvious reason (fast is better than slow), and because it justifies being so aggressive in assuming that skip scan might work out during planning (being wrong without really losing is nice). But there is also a downside: it makes it even harder to model costs at runtime, from within the optimizer. If I measure the actual runtime costs other than runtime (e.g., buffers accesses), I'm not sure that the optimizer is wrong to think that the parallel sequential scan is faster. It looks approximately correct. It is only when we look at runtime that the optimizer's choice looks wrong. Which is...awkward. In general, I have very little idea about how to improve the costing within btcostestimate. I am hoping that somebody has better ideas about it. btcostestimate is definitely the area where the patch is weakest right now. > I look at btcostestimate() to find the reason and found the bound quals > and cost.num_sa_scans are different from my expectation. > > My assumption is > * bound quals is id3=XXX (and id1 and id2 are skipped attributes) > * cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans > per skipped attribute) > > But it's wrong. The above index scan result is > * bound quals is NULL > * cost.num_sa_scans = 1 The logic with cost.num_sa_scans was definitely not what I intended. That's fixed in v3, at least. But the code in btcostestimate is still essentially the same as in earlier versions -- it needs to be completely redesigned (or, uh, designed for the first time). > As I know you said the below, but I'd like to know the above is expected or not. > Currently, there is an assumption that "there will be 10 primitive index scans > per skipped attribute". Is any chance to use pg_stats.n_distinct? It probably makes sense to use pg_stats.n_distinct here. But how? If the problem is that we're too pessimistic, then I think that this will usually (though not always) make us more pessimistic. Isn't that the wrong direction to go in? (We're probably also too optimistic in some cases, but being too pessimistic is a bigger problem in practice.) For example, your test case involved 11 distinct values in each column. The current approach of hard-coding 10 (which is just a temporary hack) should actually make the scan look a bit cheaper than it would if we used the true ndistinct. Another underlying problem is that the existing SAOP costing really isn't very accurate, without skip scan -- that's a big source of the pessimism with arrays/skipping. Why should we be able to get the true number of primitive index scans just by multiplying together each omitted prefix column's ndistinct? That approach is good for getting the worst case, which is probably relevant -- but it's probably not a very good assumption for the average case. (Though at least we can cap the total number of primitive index scans to 1/3 of the total number of pages in the index in btcostestimate, since we have guarantees about the worst case as of Postgres 17.) -- Peter Geoghegan
Attachment
On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v3, which generalizes skip scan, allowing it to work with > opclasses/types that lack a skip support routine. In other words, v3 > makes skip scan work for all types, including continuous types, where > it's impractical or infeasible to add skip support. Attached is v4, which: * Fixes a previous FIXME item affecting range skip scans/skip arrays used in cross-type scenarios. * Refactors and simplifies the handling of range inequalities associated with skip arrays more generally. We now always use inequality scan keys during array advancement (and when descending the tree within _bt_first), rather than trying to use a datum taken from the range inequality as an array element directly. This gives us cleaner separation between scan keys/data types in cross-type scenarios: skip arrays will now only ever contain "elements" of opclass input type. Sentinel values such as -inf are expanded to represent "the lowest possible value that comes after the array's low_compare lower bound, if any". Opclasses that don't offer skip support took roughly this same approach within v3, but in v4 all opclasses do it the same way (so opclasses with skip support use the SK_BT_NEG_INF sentinel marking in their scan keys, though never the SK_BT_NEXTKEY sentinel marking). This is really just a refactoring revision. Nothing particularly exciting here compared to v3. -- Peter Geoghegan
Attachment
> On Wed, Jun 26, 2024 at 03:16:07PM GMT, Peter Geoghegan wrote: > > Loose index scan is a far more specialized technique than skip scan. > It only applies within special scans that feed into a DISTINCT group > aggregate. Whereas my skip scan patch isn't like that at all -- it's > much more general. With my patch, nbtree has exactly the same contract > with the executor/core code as before. There are no new index paths > generated by the optimizer to make skip scan work, even. Skip scan > isn't particularly aimed at improving group aggregates (though the > benchmark I'll show happens to involve a group aggregate, simply > because the technique works best with large and expensive index > scans). I see that the patch is not supposed to deal with aggregates in any special way. But from what I understand after a quick review, skip scan is not getting applied to them if there are no quals in the query (in that case _bt_preprocess_keys returns before calling _bt_preprocess_array_keys). Yet such queries could benefit from skipping, I assume they still could be handled by the machinery introduced in this patch? > > Currently, there is an assumption that "there will be 10 primitive index scans > > per skipped attribute". Is any chance to use pg_stats.n_distinct? > > It probably makes sense to use pg_stats.n_distinct here. But how? > > If the problem is that we're too pessimistic, then I think that this > will usually (though not always) make us more pessimistic. Isn't that > the wrong direction to go in? (We're probably also too optimistic in > some cases, but being too pessimistic is a bigger problem in > practice.) > > For example, your test case involved 11 distinct values in each > column. The current approach of hard-coding 10 (which is just a > temporary hack) should actually make the scan look a bit cheaper than > it would if we used the true ndistinct. > > Another underlying problem is that the existing SAOP costing really > isn't very accurate, without skip scan -- that's a big source of the > pessimism with arrays/skipping. Why should we be able to get the true > number of primitive index scans just by multiplying together each > omitted prefix column's ndistinct? That approach is good for getting > the worst case, which is probably relevant -- but it's probably not a > very good assumption for the average case. (Though at least we can cap > the total number of primitive index scans to 1/3 of the total number > of pages in the index in btcostestimate, since we have guarantees > about the worst case as of Postgres 17.) Do I understand correctly, that the only way how multiplying ndistincts could produce too pessimistic results is when there is a correlation between distinct values? Can one benefit from the extended statistics here? And while we're at it, I think it would be great if the implementation will allow some level of visibility about the skip scan. From what I see, currently it's by design impossible for users to tell whether something was skipped or not. But when it comes to planning and estimates, maybe it's not a bad idea to let explain analyze show something like "expected number of primitive scans / actual number of primitive scans".
On Sat, Aug 3, 2024 at 3:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I see that the patch is not supposed to deal with aggregates in any special > way. Right. > But from what I understand after a quick review, skip scan is not getting > applied to them if there are no quals in the query (in that case > _bt_preprocess_keys returns before calling _bt_preprocess_array_keys). Right. > Yet such queries could benefit from skipping, I assume they still could be handled by > the machinery introduced in this patch? I'm not sure. There are no real changes required inside _bt_advance_array_keys with this patch -- skip arrays are dealt with in essentially the same way as conventional arrays (as of Postgres 17). I suspect that loose index scan would be best implemented using _bt_advance_array_keys. It could also "plug in" to the existing _bt_advance_array_keys design, I suppose. As I touched on already, your loose index scan patch applies high-level semantic information in a way that is very different to my skip scan patch. This means that it makes revisions to the index AM API (if memory serves it adds a callback called amskip to that API). It also means that loose index scan can actually avoid heap accesses; loose scans wholly avoid accessing logical rows (in both the index and the heap) by reasoning that it just isn't necessary to do so at all. Skipping happens in both data structures. Right? Obviously, my new skip scan patch cannot possibly reduce the number of heap page accesses required by a given index scan. Precisely the same logical rows must be accessed as before. There is no two-way conversation between the index AM and the table AM about which rows/row groupings have at least one visible tuple. We're just navigating through the index more efficiently, without changing any contract outside of nbtree itself. The "skip scan" name collision is regrettable. But the fact is that Oracle, MySQL, and now SQLite all call this feature skip scan. That feels like the right precedent to follow. > Do I understand correctly, that the only way how multiplying ndistincts could > produce too pessimistic results is when there is a correlation between distinct > values? Yes, that's one problem with the costing. Not the only one, though. The true number of primitive index scans depends on the cardinality of the data. For example, a skip scan might be the cheapest plan by far if (say) 90% of the index has the same leading column value and the remaining 10% has totally unique values. We'd still do a bad job of costing this query with an accurate ndistinct for the leading column. We really one need to do one or two primitive index scans for "the first 90% of the index", and one more primitive index scan for "the remaining 10% of the index". For a query such as this, we "require a full index scan for the remaining 10% of the index", which is suboptimal, but doesn't fundamentally change anything (I guess that a skip scan is always suboptimal, in the sense that you could always do better by having more indexes). > Can one benefit from the extended statistics here? I really don't know. Certainly seems possible in cases with more than one skipped leading column. The main problem with the costing right now is that it's just not very well thought through, in general. The performance at runtime depends on the layout of values in the index itself, so the underlying way that you'd model the costs doesn't have any great precedent in costsize.c. We do have some idea of the number of leaf pages we'll access in btcostestimate(), but that works in a way that isn't really fit for purpose. It kind of works with one primitive index scan, but works much less well with multiple primitive scans. > And while we're at it, I think it would be great if the implementation will > allow some level of visibility about the skip scan. From what I see, currently > it's by design impossible for users to tell whether something was skipped or > not. But when it comes to planning and estimates, maybe it's not a bad idea to > let explain analyze show something like "expected number of primitive scans / > actual number of primitive scans". I agree. I think that that's pretty much mandatory for this patch. At least the actual number of primitive scans should be exposed. Not quite as sure about showing the estimated number, since that might be embarrassingly wrong quite regularly, without it necessarily mattering that much (I'd worry that it'd be distracting). Displaying the number of primitive scans would already be useful for index scans with SAOPs, even without this patch. The same general concepts (estimated vs. actual primitive index scans) already exist, as of Postgres 17. That's really nothing new. -- Peter Geoghegan
On Sat, Aug 3, 2024 at 6:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > Displaying the number of primitive scans would already be useful for > index scans with SAOPs, even without this patch. The same general > concepts (estimated vs. actual primitive index scans) already exist, > as of Postgres 17. That's really nothing new. We actually expose this via instrumentation, in a certain sense. This is documented by a "Note": https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-INDEXES-VIEW That is, we already say "Each internal primitive index scan increments pg_stat_all_indexes.idx_scan, so it's possible for the count of index scans to significantly exceed the total number of index scan executor node executions". So, as I said in the last email, advertising the difference between # of primitive index scans and # of index scan executor node executions in EXPLAIN ANALYZE is already a good idea. -- Peter Geoghegan
On Wed, Jul 24, 2024 at 5:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v4 Attached is v5, which splits the code from v4 patch into 2 pieces -- it becomes 0002-* and 0003-*. Certain refactoring work now appears under its own separate patch/commit -- see 0002-* (nothing new here, except the commit message/patch structure). The patch that actually adds skip scan (0003-* in this new version) has been further polished, though not in a way that I think is interesting enough to go into here. The interesting and notable change for v5 is the addition of the code in 0001-*. The new 0001-* patch is concerned with certain aspects of how _bt_advance_array_keys decides whether to start another primitive index scan (or to stick with the ongoing one for one more leaf page instead). This is a behavioral change, albeit a subtle one. It's also kinda independent of skip scan (more on why that is at the end). It's easiest to explain why 0001-* matters by way of an example. My example will show significantly more internal/root page accesses than seen on master, though only when 0002-* and 0003-* are applied, and 0001-* is omitted. When all 3 v5 patches are applied together, the total number of index pages accessed by the test query will match the master branch. It's important that skip scan never loses by much to the master branch, of course. Even when the details of the index/scan are inconvenient to the implementation, in whatever way. Setup: create table demo (int4 a, numeric b); create index demo_idx on demo (a, b); insert into demo select a, random() from generate_series(1, 10000) a, generate_series(1,5) five_rows_per_a_val; vacuum demo; We now have a btree index "demo_idx", which has two levels (a root page plus a leaf level). The root page contains several hundred pivot tuples, all of which have their "b" value truncated away (or have the value -inf, if you prefer), with just one prefix "a" column left in place. Naturally, every leaf page has a high key with its own separator key that matches one particular tuple that appears in the root page (except for the rightmost leaf page). So our leaf level scan will see lots of truncated leaf page high keys (all matching a corresponding root page tuple). Test query: select a from demo where b > 0.99; This is a query that really shouldn't be doing any skipping at all. We nevertheless still see a huge amount of skipping with this query, ocne 0001-* is omitted. Prior to 0001-*, a new primitive index scan is started whenever the scan reaches a "boundary" between adjoining leaf pages. That is, whenever _bt_advance_array_keys stopped on a high key pstate.finaltup. So without the new 0001-* work, the number of page accesses almost doubles (because we access the root page once per leaf page accessed, instead of just accessing it once for the whole scan). What skip scan should have been doing all along (and will do now) is to step forward to the next right sibling leaf page whenever it reaches a boundary between leaf pages. This should happen again and again, without our ever choosing to start a new primitive index scan instead (it shouldn't happen even once with this query). In other words, we ought to behave just like a full index scan would behave with this query -- which is exactly what we get on master. The scan will still nominally "use skip scan" even with this fix in place, but in practice, for this particular query/index, the scan won't ever actually decide to skip. So it at least "looks like" an index scan from the point of view of EXPLAIN (ANALYZE, BUFFERS). There is a separate question of how many CPU cycles we use to do all this, but for now my focus is on total pages accessed by the patch versus on master, especially for adversarial cases such as this. It should be noted that the skip scan patch never had any problems with this very similar query (same table as before): select a from demo where b < 0.01; The fact that we did the wrong thing for the first query, but the right thing for this second similar query, was solely due to certain accidental implementation details -- it had nothing to do with the fundamentals of the problem. You might even say that 0001-* makes the original "b > 0.99" case behave in the same manner as this similar "b < 0.01" case, which is justifiable on consistency grounds. Why wouldn't these two cases behave similarly? It's only logical. The underlying problem arguably has little to do with skip scan; whether we use a real SAOP array on "a" or a consed up skip array is incidental to the problem that my example highlights. As always, the underlying "array type" (skip vs SOAP) only matters to the lowest level code. And so technically, this is an existing issue on HEAD/master. You can see that for yourself by making the problematic query's qual "where a = any ('every possible a value') and b > 0.99" -- same problem on Postgres 17, without involving skip scan. To be sure, the underlying problem does become more practically relevant with the invention of skip arrays for skip scan, but 0001-* can still be treated as independent work. It can be committed well ahead of the other stuff IMV. The same is likely also true of the refactoring now done in 0002-* -- it does refactoring that makes sense, even without skip scan. And so I don't expect it to take all that long for it to be committable. -- Peter Geoghegan
Attachment
On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra <tomas@vondra.me> wrote: > I started looking at this patch today. Thanks for taking a look! > The first thing I usually do for > new patches is a stress test, so I did a simple script that generates > random table and runs a random query with IN() clause with various > configs (parallel query, index-only scans, ...). And it got stuck on a > parallel query pretty quick. I can reproduce this locally, without too much difficulty. Unfortunately, this is a bug on master/Postgres 17. Some kind of issue in my commit 5bf748b8. The timing of this is slightly unfortunate. There's only a few weeks until the release of 17, plus I have to travel for work over the next week. I won't be back until the 16th, and will have limited availability between then and now. I think that I'll have ample time to debug and fix the issue ahead of the release of 17, though. Looks like the problem is a parallel index scan with SAOP array keys can find itself in a state where every parallel worker waits for the leader to finish off a scheduled primitive index scan, while the leader itself waits for the scan's tuple queue to return more tuples. Obviously, the query will effectively go to sleep indefinitely when that happens (unless and until the DBA cancels the query). This is only possible with just the right/wrong combination of array keys and index cardinality. I cannot recreate the problem with parallel_leader_participation=off, which strongly suggests that leader participation is a factor. I'll find time to study this in detail as soon as I can. Further background: I was always aware of the leader's tendency to go away forever shortly after the scan begins. That was supposed to be safe, since we account for it by serializing the scan's current array keys in shared memory, at the point a primitive index scan is scheduled -- any backend should be able to pick up where any other backend left off, no matter how primitive scans are scheduled. That now doesn't seem to be completely robust, likely due to restrictions on when and how other backends can pick up the scheduled work from within _bt_first, at the point that it calls _bt_parallel_seize. In short, one or two details of how backends call _bt_parallel_seize to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought. -- Peter Geoghegan
On 9/12/24 16:49, Matthias van de Meent wrote: > On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan <pg@bowt.ie> wrote: >> > ... > > The fix in 0001 is relatively simple: we stop backends from waiting > for a concurrent backend to resolve the NEED_PRIMSCAN condition, and > instead move our local state machine so that we'll hit _bt_first > ourselves, so that we may be able to start the next primitive scan. > Also attached is 0002, which adds tracking of responsible backends to > parallel btree scans, thus allowing us to assert we're never waiting > for our own process to move the state forward. I found this patch > helpful while working on solving this issue, even if it wouldn't have > found the bug as reported. > No opinion on the analysis / coding, but per my testing the fix indeed addresses the issue. The script reliably got stuck within a minute, now it's running for ~1h just fine. It also checks results and that seems fine too, so that seems fine too. regards -- Tomas Vondra
On Mon, Sep 16, 2024 at 3:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree with your approach, but I'm concerned about it causing > confusion inside _bt_parallel_done. And so I attach a v2 revision of > your bug fix. v2 adds a check that nails that down, too. Pushed this just now. Thanks -- Peter Geoghegan
On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas@vondra.me> wrote: > I've been looking at this patch over the couple last days, mostly doing > some stress testing / benchmarking (hence the earlier report) and basic > review. Thanks for taking a look! Very helpful. > I do have some initial review comments, and the testing produced > some interesting regressions (not sure if those are the cases where > skipscan can't really help, that Peter mentioned he needs to look into). The one type of query that's clearly regressed in a way that's just not acceptable are queries where we waste CPU cycles during scans where it's truly hopeless. For example, I see a big regression on one of the best cases for the Postgres 17 work, described here: https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement Notably, these cases access exactly the same buffers/pages as before, so this really isn't a matter of "doing too much skipping". The number of buffers hit exactly matches what you'll see on Postgres 17. It's just that we waste too many CPU cycles in code such as _bt_advance_array_keys, to uselessly maintain skip arrays. I'm not suggesting that there won't be any gray area with these regressions -- nothing like this will ever be that simple. But it seems to me like I should go fix these obviously-not-okay cases next, and then see where that leaves everything else, regressions-wise. That seems likely to be the most efficient way of dealing with the regressions. So I'll start there. That said, I *would* be surprised if you found a regression in any query that simply didn't receive any new scan key transformations in new preprocessing code in places like _bt_decide_skipatts and _bt_skip_preproc_shrink. I see that many of the queries that you're using for your stress-tests "aren't really testing skip scan", in this sense. But I'm hardly about to tell you that you shouldn't spend time on such queries -- that approach just discovered a bug affecting Postgres 17 (that was also surprising, but it still happened!). My point is that it's worth being aware of which test queries actually use skip arrays in the first place -- it might help you with your testing. There are essentially no changes to _bt_advance_array_keys that'll affect traditional SAOP arrays (with the sole exception of changes made by v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which affect every kind of array in the same way). > 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch > > - I find the places that increment "nsearches" a bit random. Each AM > does it in entirely different place (at least it seems like that to me). > Is there a way make this a bit more consistent? From a mechanical perspective there is nothing at all random about it: we do this at precisely the same point that we currently call pgstat_count_index_scan, which in each index AM maps to one descent of the index. It is at least consistent. Whenever a B-Tree index scan shows "Index Scans: N", you'll see precisely the same number by swapping it with an equivalent contrib/btree_gist-based GiST index and running the same query again (assuming that index tuples that match the array keys are spread apart in both the B-Tree and GiST indexes). (Though I see problems with the precise place that nbtree calls pgstat_count_index_scan right now, at least in certain edge-cases, which I discuss below in response to your questions about that.) > uint64 btps_nsearches; /* instrumentation */ > > Instrumentation what? What's the counter for? Will fix. In case you missed it, there is another thread + CF Entry dedicated to discussing this instrumentation patch: https://commitfest.postgresql.org/49/5183/ https://www.postgresql.org/message-id/flat/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com > - I see _bt_first moved the pgstat_count_index_scan, but doesn't that > mean we skip it if the earlier code does "goto readcomplete"? Shouldn't > that still count as an index scan? In my opinion, no, it should not. We're counting the number of times we'll have descended the tree using _bt_search (or using _bt_endpoint, perhaps), which is a precisely defined physical cost. A little like counting the number of buffers accessed. I actually think that this aspect of how we call pgstat_count_index_scan is a bug that should be fixed, with the fix backpatched to Postgres 17. Right now, we see completely different counts for a parallel index scan, compared to an equivalent serial index scan -- differences that cannot be explained as minor differences caused by parallel scan implementation details. I think that it's just wrong right now, on master, since we're simply not counting the thing that we're supposed to be counting (not reliably, not if it's a parallel index scan). > - show_indexscan_nsearches does this: > > if (scanDesc && scanDesc->nsearches > 0) > ExplainPropertyUInteger("Index Searches", NULL, > scanDesc->nsearches, es); > > But shouldn't it divide the count by nloops, similar to (for example) > show_instrumentation_count? I can see arguments for and against doing it that way. It's ambiguous/subjective, but on balance I favor not dividing by nloops. You can make a similar argument for doing this with "Buffers: ", and yet we don't divide by nloops there, either. Honestly, I just want to find a way to do this that everybody can live with. Better documentation could help here. > 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch > > - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation > rather weird, I'd just call it "oppositeDirCheck". Will fix. > 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch > > - nothing Great. I think that I should be able to commit this one soon, since it's independently useful work. > 4) v6-0004-Add-skip-scan-to-nbtree.patch > > - indices.sgml seems to hahve typo "Intevening" -> "Intervening" > > - It doesn't seem like a good idea to remove the paragraph about > multicolumn indexes and replace it with just: > > Multicolumn indexes should be used judiciously. > > I mean, what does judiciously even mean? what should the user consider > to be judicious? Seems rather unclear to me. Admittedly, the old text > was not much helpful, but at least it gave some advice. Yeah, this definitely needs more work. > But maybe more importantly, doesn't skipscan apply only to a rather > limited subset of data types (that support increment/decrement)? Doesn't > the new wording mostly ignore that, implying skipscan applies to all > btree indexes? I don't think it mentions datatypes anywhere, but there > are many indexes on data types like text, UUID and so on. Actually, no, skip scan works in almost the same way with all data types. Earlier versions of the patch didn't support every data type (perhaps I should have waited for that before posting my v1), but the version of the patch you looked at has no restrictions on any data type. You must be thinking of whether or not an opclass has skip support. That's just an extra optimization, which can be used for a small handful of discrete data types such as integer and date (hard to imagine how skip support could ever be implemented for types like numeric and text). There is a temporary testing GUC that will allow you to get a sense of how much skip support can help: try "set skipscan_skipsupport_enabled=off" with (say) my original MDAM test query to get a sense of that. You'll see more buffer hits needed for "next key probes", though not dramatically more. It's worth having skip support (the idea comes from the MDAM paper), but it's not essential. Whether or not an opclass has skip support isn't accounted for by the cost model, but I doubt that it's worth addressing (the cost model is already pessimistic). > - Very subjective nitpicking, but I find it a bit strange when a comment > about a block is nested in the block, like in _bt_first() for the > array->null_elem check. Will fix. > - assignProcTypes() claims providing skipscan for cross-type scenarios > doesn't make sense. Why is that? I'm not saying the claim is wrong, but > it's not clear to me why would that be the case. It is just talking about the support function that skip scan can optionally use, where it makes sense (skip support functions). The relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is largely copied from the existing nearby "else if (member->number == BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In both stanzas we're talking about a capability that maps to a particular "input opclass", which means the opclass that maps to the datums that are stored on disk, in index tuples. There are no restrictions on the use of skip scan with queries that happen to involve the use of cross-type operators. It doesn't even matter if we happen to be using an incomplete opfamily, since range skip arrays never need to *directly* take the current array element from a lower/upper bound inequality scan key's argument. It all happens indirectly: code in places like _bt_first and _bt_checkkeys can use inequalities (which are stored in BTArrayKeyInfo.low_compare and BTArrayKeyInfo.high_compare) to locate the next matching on-disk index tuple that satisfies the inequality in question. Obviously, the located datum must be the same type as the one used by the array and its scan key (it has to be the input opclass type if it's taken from an index tuple). I think that it's a bit silly that nbtree generally bends over backwards to find a way to execute a scan, given an incomplete opfamily; in a green field situation it would make sense to just throw an error instead. Even still, skip scan works in a way that is maximally forgiving when incomplete opfamilies are used. Admittedly, it is just about possible to come up with a scenario where we'll now throw an error for a query that would have worked on Postgres 17. But that's no different to what would happen if the query had an explicit "= any( )" non-cross-type array instead of an implicit non-cross-type skip array. The real problem in these scenarios is the lack of a suitable cross-type ORDER proc (for a cross-type-operator query) within _bt_first -- not the lack of cross-type operators. This issue with missing ORDER procs just doesn't seem worth worrying about, since, as I said, even slightly different queries (that don't use skip scan) are bound to throw the same errors either way. > Peter asked me to look at the costing, and I think it looks generally > sensible. I'm glad that you think that I basically have the right idea here. Hard to know how to approach something like this, which doesn't have any kind of precedent to draw on. > We don't really have a lot of information to base the costing > on in the first place - the whole point of skipscan is about multicolumn > indexes, but none of the existing extended statistic seems very useful. > We'd need some cross-column correlation info, or something like that. Maybe, but that would just mean that we'd sometimes be more optimistic about skip scan helping than we are with the current approach of pessimistically assuming that there is no correlation at all. Not clear that being pessimistic in this sense isn't the right thing to do, despite the fact that it's clearly less accurate on average. > There's one thing that I don't quite understand, and that's how > btcost_correlation() adjusts correlation for multicolumn indexes: > > if (index->nkeycolumns > 1) > indexCorrelation = varCorrelation * 0.75; > > That seems fine for a two-column index, I guess. But shouldn't it > compound for indexes with more keys? I mean, 0.75 * 0.75 for third > column, etc? I don't think btcostestimate() does that, it just remembers > whatever btcost_correlation() returns. I don't know either. In general I'm out of my comfort zone here. > The only alternative approach I can think of is not to adjust the > costing for the index scan at all, and only use this to enable (or not > enable) the skipscan internally. That would mean the overall plan > remains the same, and maybe sometimes we would think an index scan would > be too expensive and use something else. Not great, but it doesn't have > the risk of regressions - IIUC we can disable the skipscan at runtime, > if we realize it's not really helpful. In general I would greatly prefer to not have a distinct kind of index path for scans that use skip scan. I'm quite keen on a design that allows the scan to adapt to unpredictable conditions at runtime. Of course, that doesn't preclude passing the index scan a hint about what's likely to work at runtime, based on information figured out when costing the scan. Perhaps that will prove necessary to avoid regressing index scans that are naturally quite cheap already -- scans where we really need to have the right general idea from the start to avoid any regressions. I'm not opposed to that, provided the index scan has the ability to change its mind when (for whatever reason) the guidance from the optimizer turns out to be wrong. > As usual, I wrote a bash script to do a bit of stress testing. It > generates tables with random data, and then runs random queries with > random predicates on them, while mutating a couple parameters (like > number of workers) to trigger different plans. It does that on 16, > master and with the skipscan patch (with the fix for parallel scans). I wonder if some of the regressions you see can be tied to the use of an LWLock in place of the existing use of a spin lock. I did that because I sometimes need to allocate memory to deserialize the array keys, with the exclusive lock held. It might be the case that a lot of these regressions are tied to that, or something else that is far from obvious...have to investigate. In general, I haven't done much on parallel index scans here (I only added support for them very recently), whereas your testing places a lot of emphasis on parallel scans. Nothing wrong with that emphasis (it caught that 17 bug), but just want to put it in context. > I've uploaded the script and results from the last run here: > > https://github.com/tvondra/pg-skip-scan-tests > > There's the "run-mdam.sh" script that generates tables/queries, runs > them, collects all kinds of info about the query, and produces files > with explain plans, CSV with timings, etc. It'll take me a while to investigate all this data. > Anyway, I ran a couple thousand such queries, and I haven't found any > incorrect results (the script compares that between versions too). So > that's good ;-) That's good! > But my main goal was to see how this affects performance. The tables > were pretty small (just 1M rows, maybe ~80MB), but with restarts and > dropping caches, large enough to test this. The really compelling cases all tend to involve fairly selective index scans. Obviously, skip scan can only save work by navigating the index structure more efficiently (unlike loose index scan). So if the main cost is inherently bound to be the cost of heap accesses, then we shouldn't expect a big speed up. > For example, one of the slowed down queries is query 702 (top of page 8 > in the PDF). The query is pretty simple: > > explain (analyze, timing off, buffers off) > select id1,id2 from t_1000000_1000_1_2 > where NOT (id1 in (:list)) AND (id2 = :value); > > and it was executed on a table with random data in two columns, each > with 1000 distinct values. This is perfectly random data, so a great > match for the assumptions in costing etc. > > But with uncached data, this runs in ~50 ms on master, but takes almost > 200 ms with skipscan (these timings are from my laptop, but similar to > the results). I'll need to investigate this specifically. That does seem odd. FWIW, it's a pity that the patch doesn't know how to push down the NOT IN () here. The MDAM paper contemplates such a scheme. We see the use of filter quals here, when in principle this could work by using a skip array that doesn't generate elements that appear in the NOT IN() list (it'd generate every possible indexable value *except* the given list/array values). The only reason that I haven't implemented this yet is because I'm not at all sure how to make it work on the optimizer side. The nbtree side of the implementation will probably be quite straightforward, since it's really just a slight variant of a skip array, that excludes certain values. > -- with skipscan > Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on > t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16) > (actual rows=811 loops=1) > Index Cond: (id2 = 997) > Index Searches: 1007 > Filter: (id1 <> ALL ('{983,...,640}'::bigint[])) > Rows Removed by Filter: 163 > Heap Fetches: 0 > Planning Time: 3.730 ms > Execution Time: 238.554 ms > (8 rows) > > I haven't looked into why this is happening, but this seems like a > pretty good match for skipscan (on the first column). And for the > costing too - it's perfectly random data, no correllation, etc. I wonder what "Buffers: N" shows? That's usually the first thing I look at (that and "Index Searches", which looks like what you said it should look like here). But, yeah, let me get back to you on this. Thanks again! -- Peter Geoghegan
On 9/18/24 00:14, Peter Geoghegan wrote: > On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas@vondra.me> wrote: >> I've been looking at this patch over the couple last days, mostly doing >> some stress testing / benchmarking (hence the earlier report) and basic >> review. > > Thanks for taking a look! Very helpful. > >> I do have some initial review comments, and the testing produced >> some interesting regressions (not sure if those are the cases where >> skipscan can't really help, that Peter mentioned he needs to look into). > > The one type of query that's clearly regressed in a way that's just > not acceptable are queries where we waste CPU cycles during scans > where it's truly hopeless. For example, I see a big regression on one > of the best cases for the Postgres 17 work, described here: > > https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement > > Notably, these cases access exactly the same buffers/pages as before, > so this really isn't a matter of "doing too much skipping". The number > of buffers hit exactly matches what you'll see on Postgres 17. It's > just that we waste too many CPU cycles in code such as > _bt_advance_array_keys, to uselessly maintain skip arrays. > > I'm not suggesting that there won't be any gray area with these > regressions -- nothing like this will ever be that simple. But it > seems to me like I should go fix these obviously-not-okay cases next, > and then see where that leaves everything else, regressions-wise. That > seems likely to be the most efficient way of dealing with the > regressions. So I'll start there. > > That said, I *would* be surprised if you found a regression in any > query that simply didn't receive any new scan key transformations in > new preprocessing code in places like _bt_decide_skipatts and > _bt_skip_preproc_shrink. I see that many of the queries that you're > using for your stress-tests "aren't really testing skip scan", in this > sense. But I'm hardly about to tell you that you shouldn't spend time > on such queries -- that approach just discovered a bug affecting > Postgres 17 (that was also surprising, but it still happened!). My > point is that it's worth being aware of which test queries actually > use skip arrays in the first place -- it might help you with your > testing. There are essentially no changes to _bt_advance_array_keys > that'll affect traditional SAOP arrays (with the sole exception of > changes made by > v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which > affect every kind of array in the same way). > Makes sense. I started with the testing before before even looking at the code, so it's mostly a "black box" approach. I did read the 1995 paper before that, and the script generates queries with clauses inspired by that paper, in particular: - col = $value - col IN ($values) - col BETWEEN $value AND $value - NOT (clause) - clause [AND|OR] clause There certainly may be gaps and interesting cases the script does not cover. Something to improve. >> 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch >> >> - I find the places that increment "nsearches" a bit random. Each AM >> does it in entirely different place (at least it seems like that to me). >> Is there a way make this a bit more consistent? > > From a mechanical perspective there is nothing at all random about it: > we do this at precisely the same point that we currently call > pgstat_count_index_scan, which in each index AM maps to one descent of > the index. It is at least consistent. Whenever a B-Tree index scan > shows "Index Scans: N", you'll see precisely the same number by > swapping it with an equivalent contrib/btree_gist-based GiST index and > running the same query again (assuming that index tuples that match > the array keys are spread apart in both the B-Tree and GiST indexes). > > (Though I see problems with the precise place that nbtree calls > pgstat_count_index_scan right now, at least in certain edge-cases, > which I discuss below in response to your questions about that.) > OK, understood. FWIW I'm not saying these places are "wrong", just that it feels each AM does that in a very different place. >> uint64 btps_nsearches; /* instrumentation */ >> >> Instrumentation what? What's the counter for? > > Will fix. > > In case you missed it, there is another thread + CF Entry dedicated to > discussing this instrumentation patch: > > https://commitfest.postgresql.org/49/5183/ > https://www.postgresql.org/message-id/flat/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com > Thanks, I wasn't aware of that. >> - I see _bt_first moved the pgstat_count_index_scan, but doesn't that >> mean we skip it if the earlier code does "goto readcomplete"? Shouldn't >> that still count as an index scan? > > In my opinion, no, it should not. > > We're counting the number of times we'll have descended the tree using > _bt_search (or using _bt_endpoint, perhaps), which is a precisely > defined physical cost. A little like counting the number of buffers > accessed. I actually think that this aspect of how we call > pgstat_count_index_scan is a bug that should be fixed, with the fix > backpatched to Postgres 17. Right now, we see completely different > counts for a parallel index scan, compared to an equivalent serial > index scan -- differences that cannot be explained as minor > differences caused by parallel scan implementation details. I think > that it's just wrong right now, on master, since we're simply not > counting the thing that we're supposed to be counting (not reliably, > not if it's a parallel index scan). > OK, understood. If it's essentially an independent issue (perhaps even counts as a bug?) what about correcting it on master first? Doesn't sound like something we'd backpatch, I guess. >> - show_indexscan_nsearches does this: >> >> if (scanDesc && scanDesc->nsearches > 0) >> ExplainPropertyUInteger("Index Searches", NULL, >> scanDesc->nsearches, es); >> >> But shouldn't it divide the count by nloops, similar to (for example) >> show_instrumentation_count? > > I can see arguments for and against doing it that way. It's > ambiguous/subjective, but on balance I favor not dividing by nloops. > You can make a similar argument for doing this with "Buffers: ", and > yet we don't divide by nloops there, either. > > Honestly, I just want to find a way to do this that everybody can live > with. Better documentation could help here. > Seems like a bit of a mess. IMHO we should either divide everything by nloops (so that everything is "per loop", or not divide anything. My vote would be to divide, but that's mostly my "learned assumption" from the other fields. But having a 50:50 split is confusing for everyone. >> 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch >> >> - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation >> rather weird, I'd just call it "oppositeDirCheck". > > Will fix. > >> 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch >> >> - nothing > > Great. I think that I should be able to commit this one soon, since > it's independently useful work. > +1 >> 4) v6-0004-Add-skip-scan-to-nbtree.patch >> >> - indices.sgml seems to hahve typo "Intevening" -> "Intervening" >> >> - It doesn't seem like a good idea to remove the paragraph about >> multicolumn indexes and replace it with just: >> >> Multicolumn indexes should be used judiciously. >> >> I mean, what does judiciously even mean? what should the user consider >> to be judicious? Seems rather unclear to me. Admittedly, the old text >> was not much helpful, but at least it gave some advice. > > Yeah, this definitely needs more work. > >> But maybe more importantly, doesn't skipscan apply only to a rather >> limited subset of data types (that support increment/decrement)? Doesn't >> the new wording mostly ignore that, implying skipscan applies to all >> btree indexes? I don't think it mentions datatypes anywhere, but there >> are many indexes on data types like text, UUID and so on. > > Actually, no, skip scan works in almost the same way with all data > types. Earlier versions of the patch didn't support every data type > (perhaps I should have waited for that before posting my v1), but the > version of the patch you looked at has no restrictions on any data > type. > > You must be thinking of whether or not an opclass has skip support. > That's just an extra optimization, which can be used for a small > handful of discrete data types such as integer and date (hard to > imagine how skip support could ever be implemented for types like > numeric and text). There is a temporary testing GUC that will allow > you to get a sense of how much skip support can help: try "set > skipscan_skipsupport_enabled=off" with (say) my original MDAM test > query to get a sense of that. You'll see more buffer hits needed for > "next key probes", though not dramatically more. > > It's worth having skip support (the idea comes from the MDAM paper), > but it's not essential. Whether or not an opclass has skip support > isn't accounted for by the cost model, but I doubt that it's worth > addressing (the cost model is already pessimistic). > I admit I'm a bit confused. I probably need to reread the paper, but my impression was that the increment/decrement is required for skipscan to work. If we can't do that, how would it generate the intermediate values to search for? I imagine it would be possible to "step through" the index, but I thought the point of skip scan is to not do that. Anyway, probably a good idea for extending the stress testing script. Right now it tests with "bigint" columns only. >> - Very subjective nitpicking, but I find it a bit strange when a comment >> about a block is nested in the block, like in _bt_first() for the >> array->null_elem check. > > Will fix. > >> - assignProcTypes() claims providing skipscan for cross-type scenarios >> doesn't make sense. Why is that? I'm not saying the claim is wrong, but >> it's not clear to me why would that be the case. > > It is just talking about the support function that skip scan can > optionally use, where it makes sense (skip support functions). The > relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is > largely copied from the existing nearby "else if (member->number == > BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In > both stanzas we're talking about a capability that maps to a > particular "input opclass", which means the opclass that maps to the > datums that are stored on disk, in index tuples. > > There are no restrictions on the use of skip scan with queries that > happen to involve the use of cross-type operators. It doesn't even > matter if we happen to be using an incomplete opfamily, since range > skip arrays never need to *directly* take the current array element > from a lower/upper bound inequality scan key's argument. It all > happens indirectly: code in places like _bt_first and _bt_checkkeys > can use inequalities (which are stored in BTArrayKeyInfo.low_compare > and BTArrayKeyInfo.high_compare) to locate the next matching on-disk > index tuple that satisfies the inequality in question. Obviously, the > located datum must be the same type as the one used by the array and > its scan key (it has to be the input opclass type if it's taken from > an index tuple). > > I think that it's a bit silly that nbtree generally bends over > backwards to find a way to execute a scan, given an incomplete > opfamily; in a green field situation it would make sense to just throw > an error instead. Even still, skip scan works in a way that is > maximally forgiving when incomplete opfamilies are used. Admittedly, > it is just about possible to come up with a scenario where we'll now > throw an error for a query that would have worked on Postgres 17. But > that's no different to what would happen if the query had an explicit > "= any( )" non-cross-type array instead of an implicit non-cross-type > skip array. The real problem in these scenarios is the lack of a > suitable cross-type ORDER proc (for a cross-type-operator query) > within _bt_first -- not the lack of cross-type operators. This issue > with missing ORDER procs just doesn't seem worth worrying about, > since, as I said, even slightly different queries (that don't use skip > scan) are bound to throw the same errors either way. > OK. Thanks for the explanation. I'll think about maybe testing such queries too (with cross-type clauses). >> Peter asked me to look at the costing, and I think it looks generally >> sensible. > > I'm glad that you think that I basically have the right idea here. > Hard to know how to approach something like this, which doesn't have > any kind of precedent to draw on. > >> We don't really have a lot of information to base the costing >> on in the first place - the whole point of skipscan is about multicolumn >> indexes, but none of the existing extended statistic seems very useful. >> We'd need some cross-column correlation info, or something like that. > > Maybe, but that would just mean that we'd sometimes be more optimistic > about skip scan helping than we are with the current approach of > pessimistically assuming that there is no correlation at all. Not > clear that being pessimistic in this sense isn't the right thing to > do, despite the fact that it's clearly less accurate on average. > Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming no correlation means pessimistic skipscan costing) in a comment before btcostestimate, or somewhere close. >> There's one thing that I don't quite understand, and that's how >> btcost_correlation() adjusts correlation for multicolumn indexes: >> >> if (index->nkeycolumns > 1) >> indexCorrelation = varCorrelation * 0.75; >> >> That seems fine for a two-column index, I guess. But shouldn't it >> compound for indexes with more keys? I mean, 0.75 * 0.75 for third >> column, etc? I don't think btcostestimate() does that, it just remembers >> whatever btcost_correlation() returns. > > I don't know either. In general I'm out of my comfort zone here. > Don't we do something similar elsewhere? For example, IIRC we do some adjustments when estimating grouping in estimate_num_groups(), and incremental sort had to deal with something similar too. Maybe we could learn something from those places ... (both from the good and bad experiences). >> The only alternative approach I can think of is not to adjust the >> costing for the index scan at all, and only use this to enable (or not >> enable) the skipscan internally. That would mean the overall plan >> remains the same, and maybe sometimes we would think an index scan would >> be too expensive and use something else. Not great, but it doesn't have >> the risk of regressions - IIUC we can disable the skipscan at runtime, >> if we realize it's not really helpful. > > In general I would greatly prefer to not have a distinct kind of index > path for scans that use skip scan. I'm quite keen on a design that > allows the scan to adapt to unpredictable conditions at runtime. > Right. I don't think I've been suggesting having a separate path, I 100% agree it's better to have this as an option for index scan paths. > Of course, that doesn't preclude passing the index scan a hint about > what's likely to work at runtime, based on information figured out > when costing the scan. Perhaps that will prove necessary to avoid > regressing index scans that are naturally quite cheap already -- scans > where we really need to have the right general idea from the start to > avoid any regressions. I'm not opposed to that, provided the index > scan has the ability to change its mind when (for whatever reason) the > guidance from the optimizer turns out to be wrong. > +1 (assuming it's feasible, given the amount of available information) >> As usual, I wrote a bash script to do a bit of stress testing. It >> generates tables with random data, and then runs random queries with >> random predicates on them, while mutating a couple parameters (like >> number of workers) to trigger different plans. It does that on 16, >> master and with the skipscan patch (with the fix for parallel scans). > > I wonder if some of the regressions you see can be tied to the use of > an LWLock in place of the existing use of a spin lock. I did that > because I sometimes need to allocate memory to deserialize the array > keys, with the exclusive lock held. It might be the case that a lot of > these regressions are tied to that, or something else that is far from > obvious...have to investigate. > > In general, I haven't done much on parallel index scans here (I only > added support for them very recently), whereas your testing places a > lot of emphasis on parallel scans. Nothing wrong with that emphasis > (it caught that 17 bug), but just want to put it in context. > Sure. With this kind of testing I don't know what I'm looking for, so I try to cover very wide range of cases. Inevitably, some of the cases will not test the exact subject of the patch. I think it's fine. >> I've uploaded the script and results from the last run here: >> >> https://github.com/tvondra/pg-skip-scan-tests >> >> There's the "run-mdam.sh" script that generates tables/queries, runs >> them, collects all kinds of info about the query, and produces files >> with explain plans, CSV with timings, etc. > > It'll take me a while to investigate all this data. > I think it'd help if I go through the results and try to prepare some reproducers, to make it easier for you. After all, it's my script and you'd have to reverse engineer some of it. >> Anyway, I ran a couple thousand such queries, and I haven't found any >> incorrect results (the script compares that between versions too). So >> that's good ;-) > > That's good! > >> But my main goal was to see how this affects performance. The tables >> were pretty small (just 1M rows, maybe ~80MB), but with restarts and >> dropping caches, large enough to test this. > > The really compelling cases all tend to involve fairly selective index > scans. Obviously, skip scan can only save work by navigating the index > structure more efficiently (unlike loose index scan). So if the main > cost is inherently bound to be the cost of heap accesses, then we > shouldn't expect a big speed up. > >> For example, one of the slowed down queries is query 702 (top of page 8 >> in the PDF). The query is pretty simple: >> >> explain (analyze, timing off, buffers off) >> select id1,id2 from t_1000000_1000_1_2 >> where NOT (id1 in (:list)) AND (id2 = :value); >> >> and it was executed on a table with random data in two columns, each >> with 1000 distinct values. This is perfectly random data, so a great >> match for the assumptions in costing etc. >> >> But with uncached data, this runs in ~50 ms on master, but takes almost >> 200 ms with skipscan (these timings are from my laptop, but similar to >> the results). > > I'll need to investigate this specifically. That does seem odd. > > FWIW, it's a pity that the patch doesn't know how to push down the NOT > IN () here. The MDAM paper contemplates such a scheme. We see the use > of filter quals here, when in principle this could work by using a > skip array that doesn't generate elements that appear in the NOT IN() > list (it'd generate every possible indexable value *except* the given > list/array values). The only reason that I haven't implemented this > yet is because I'm not at all sure how to make it work on the > optimizer side. The nbtree side of the implementation will probably be > quite straightforward, since it's really just a slight variant of a > skip array, that excludes certain values. > >> -- with skipscan >> Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on >> t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16) >> (actual rows=811 loops=1) >> Index Cond: (id2 = 997) >> Index Searches: 1007 >> Filter: (id1 <> ALL ('{983,...,640}'::bigint[])) >> Rows Removed by Filter: 163 >> Heap Fetches: 0 >> Planning Time: 3.730 ms >> Execution Time: 238.554 ms >> (8 rows) >> >> I haven't looked into why this is happening, but this seems like a >> pretty good match for skipscan (on the first column). And for the >> costing too - it's perfectly random data, no correllation, etc. > > I wonder what "Buffers: N" shows? That's usually the first thing I > look at (that and "Index Searches", which looks like what you said it > should look like here). But, yeah, let me get back to you on this. > Yeah, I forgot to get that from my reproducer. But the logs in the github repo with results has BUFFERS - for master (SEQ 12621), the plan looks like this: Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on t_1000000_1000_1_2 (cost=0.96..12179.41 rows=785 width=16) (actual rows=785 loops=1) Index Cond: (id2 = 997) Filter: (id1 <> ALL ('{983, ..., 640}'::bigint[])) Rows Removed by Filter: 181 Heap Fetches: 0 Buffers: shared read=3094 Planning: Buffers: shared hit=93 read=27 Planning Time: 9.962 ms Execution Time: 38.007 ms (10 rows) and with the patch (SEQ 12623) it's this: Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on t_1000000_1000_1_2 (cost=0.96..1745.27 rows=784 width=16) (actual rows=785 loops=1) Index Cond: (id2 = 997) Index Searches: 1002 Filter: (id1 <> ALL ('{983, ..., 640}'::bigint[])) Rows Removed by Filter: 181 Heap Fetches: 0 Buffers: shared hit=1993 read=1029 Planning: Buffers: shared hit=93 read=27 Planning Time: 9.506 ms Execution Time: 179.048 ms (11 rows) This is on exactly the same data, after dropping caches and restarting the instance. So there should be no caching effects. Yet, there's a pretty clear difference - the total number of buffers is the same, but the patched version has many more hits. Yet it's slower. Weird, right? regards -- Tomas Vondra
On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas@vondra.me> wrote: > Makes sense. I started with the testing before before even looking at > the code, so it's mostly a "black box" approach. I did read the 1995 > paper before that, and the script generates queries with clauses > inspired by that paper, in particular: I think that this approach with black box testing is helpful, but also something to refine over time. Gray box testing might work best. > OK, understood. If it's essentially an independent issue (perhaps even > counts as a bug?) what about correcting it on master first? Doesn't > sound like something we'd backpatch, I guess. What about backpatching it to 17? As things stand, you can get quite contradictory counts of the number of index scans due to irrelevant implementation details from parallel index scan. It just looks wrong, particularly on 17, where it is reasonable to expect near exact consistency between parallel and serial scans of the same index. > Seems like a bit of a mess. IMHO we should either divide everything by > nloops (so that everything is "per loop", or not divide anything. My > vote would be to divide, but that's mostly my "learned assumption" from > the other fields. But having a 50:50 split is confusing for everyone. My idea was that it made most sense to follow the example of "Buffers:", since both describe physical costs. Honestly, I'm more than ready to take whatever the path of least resistance is. If dividing by nloops is what people want, I have no objections. > > It's worth having skip support (the idea comes from the MDAM paper), > > but it's not essential. Whether or not an opclass has skip support > > isn't accounted for by the cost model, but I doubt that it's worth > > addressing (the cost model is already pessimistic). > > > > I admit I'm a bit confused. I probably need to reread the paper, but my > impression was that the increment/decrement is required for skipscan to > work. If we can't do that, how would it generate the intermediate values > to search for? I imagine it would be possible to "step through" the > index, but I thought the point of skip scan is to not do that. I think that you're probably still a bit confused because the terminology in this area is a little confusing. There are two ways of explaining the situation with types like text and numeric (types that lack skip support). The two explanations might seem to be contradictory, but they're really not, if you think about it. The first way of explaining it, which focuses on how the scan moves through the index: For a text index column "a", and an int index column "b", skip scan will work like this for a query with a qual "WHERE b = 55": 1. Find the first/lowest sorting "a" value in the index. Let's say that it's "Aardvark". 2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly returning some matches. 3. Find the next value after "Aardvark" in the index using a probe like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say that it turns out to be "Abacus". 4. Look for matches "WHERE a = 'Abacus' and b = 55"... ... (repeat these steps until we've exhaustively processed every existing "a" value in the index)... The second way of explaining it, which focuses on how the skip arrays advance. Same query (and really the same behavior) as in the first explanation: 1. Skip array's initial value is the sentinel -inf, which cannot possibly match any real index tuple, but can still guide the search. So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't include the "b = 55" part, since it is unnecessary, but conceptually it's a part of what we search for within _bt_first). 2. Find that the index has no "a" values matching -inf (it inevitably cannot have any matches for -inf), but we do locate the next highest match. The closest matching value is "Aardvark". The skip array on "a" therefore advances from -inf to "Aardvark". 3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly returning some matches. 4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b = 55", which will cause us to advance the array on "a" incrementally inside _bt_advance_array_keys (just like it would if there was a standard SAOP array on "a" instead). The skip array on "a" therefore advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use sentinel values for this text column, which lacks skip support). 5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55", which cannot possibly find matches, but, again, can reposition the scan as needed. We can't find an exact match, of course, but we do locate the next closest match -- which is "Abacus", again. So the skip array now advances from "Aardvark" +infinitesimal to "Abacus". The sentinel values are made up values, but that doesn't change anything. (And, again, we don't include the "b = 55" part here, for the same reason as before.) 6. Look for matches "WHERE a = 'Abacus' and b = 55"... ...(repeat these steps as many times as required)... In summary: Even index columns that lack skip support get to "increment" (or "decrement") their arrays by using sentinel values that represent -inf (or +inf for backwards scans), as well as sentinels that represent concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal for backwards scans, say). This scheme sounds contradictory, because in one sense it allows every skip array to be incremented, but in another sense it makes it okay that we don't have a type-specific way to increment values for many individual types/opclasses. Inventing these sentinel values allows _bt_advance_array_keys to reason about arrays without really having to care about which kinds of arrays are involved, their order relative to each other, etc. In a certain sense, we don't really need explicit "next key" probes of the kind that the MDAM paper contemplates, though we do still require the same index accesses as a design with explicit accesses. Does that make sense? Obviously, if we did add skip support for text, it would be very unlikely to help performance. Sure, one can imagine incrementing from "Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but that isn't very helpful. You're almost certain to end up accessing the same pages with such a scheme, anyway. What are the chances of an index with a leading text column actually containing tuples matching (say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically zero. Whereas if the column "a" happens to use a discrete type such as integer or date, then skip support is likely to help: there's a decent chance that a value generated by incrementing the last value (and I mean incrementing it for real) will find a real match when combined with the user-supplied "b" predicate. It might be possible to add skip support for text, but there wouldn't be much point. > Anyway, probably a good idea for extending the stress testing script. > Right now it tests with "bigint" columns only. Good idea. > Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming > no correlation means pessimistic skipscan costing) in a comment before > btcostestimate, or somewhere close. Will do. > Don't we do something similar elsewhere? For example, IIRC we do some > adjustments when estimating grouping in estimate_num_groups(), and > incremental sort had to deal with something similar too. Maybe we could > learn something from those places ... (both from the good and bad > experiences). I'll make a note of that. Gonna focus on regressions for now. > Right. I don't think I've been suggesting having a separate path, I 100% > agree it's better to have this as an option for index scan paths. Cool. > Sure. With this kind of testing I don't know what I'm looking for, so I > try to cover very wide range of cases. Inevitably, some of the cases > will not test the exact subject of the patch. I think it's fine. I agree. Just wanted to make sure that we were on the same page. > I think it'd help if I go through the results and try to prepare some > reproducers, to make it easier for you. After all, it's my script and > you'd have to reverse engineer some of it. Yes, that would be helpful. I'll probably memorialize the problem by writing my own minimal test case for it. I'm using the same TDD approach for this project as was used for the related Postgres 17 project. > This is on exactly the same data, after dropping caches and restarting > the instance. So there should be no caching effects. Yet, there's a > pretty clear difference - the total number of buffers is the same, but > the patched version has many more hits. Yet it's slower. Weird, right? Yes, it's weird. It seems likely that you've found an unambiguous bug, not just a "regular" performance regression. The regressions that I already know about aren't nearly this bad. So it seems like you have the right general idea about what to expect, and it seems like your approach to testing the patch is effective. -- Peter Geoghegan
On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas@vondra.me> wrote: > For example, one of the slowed down queries is query 702 (top of page 8 > in the PDF). The query is pretty simple: > > explain (analyze, timing off, buffers off) > select id1,id2 from t_1000000_1000_1_2 > where NOT (id1 in (:list)) AND (id2 = :value); > > and it was executed on a table with random data in two columns, each > with 1000 distinct values. I cannot recreate this problem using the q702.sql repro you provided. Feels like I'm missing a step, because I find that skip scan wins nicely here. > This is perfectly random data, so a great > match for the assumptions in costing etc. FWIW, I wouldn't say that this is a particularly sympathetic case for skip scan. It's definitely still a win, but less than other cases I can imagine. This is due to the relatively large number of rows returned by the scan. Plus 1000 distinct leading values for a skip array isn't all that low, so we end up scanning over 1/3 of all of the leaf pages in the index. BTW, be careful to distinguish between leaf pages and internal pages when interpreting "Buffers:" output with the patch. Generally speaking, the patch repeats many internal page accesses, which needs to be taken into account when compare "Buffers:" counts against master. It's not uncommon for 3/4 or even 4/5 of all index page hits to be for internal pages with the patch. Whereas on master the number of internal page hits is usually tiny. This is one reason why the additional context provided by "Index Searches:" can be helpful. > But with uncached data, this runs in ~50 ms on master, but takes almost > 200 ms with skipscan (these timings are from my laptop, but similar to > the results). Even 50ms seems really slow for your test case -- with or without my patch applied. Are you sure that this wasn't an assert-enabled build? There's lots of extra assertions for the code paths used by skip scan for this, which could explain the apparent regression. I find that this same query takes only ~2.056 ms with the patch. When I disabled skip scan locally via "set skipscan_prefix_cols = 0" (which should give me behavior that's pretty well representative of master), it takes ~12.039 ms. That's exactly what I'd expect for this query: a solid improvement, though not the really enormous ones that you'll see when skip scan is able to avoid reading many of the index pages that master reads. -- Peter Geoghegan
On 9/19/24 21:22, Peter Geoghegan wrote: > On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas@vondra.me> wrote: >> For example, one of the slowed down queries is query 702 (top of page 8 >> in the PDF). The query is pretty simple: >> >> explain (analyze, timing off, buffers off) >> select id1,id2 from t_1000000_1000_1_2 >> where NOT (id1 in (:list)) AND (id2 = :value); >> >> and it was executed on a table with random data in two columns, each >> with 1000 distinct values. > > I cannot recreate this problem using the q702.sql repro you provided. > Feels like I'm missing a step, because I find that skip scan wins > nicely here. > I don't know, I can reproduce it just fine. I just tried with v7. What I do is this: 1) build master and patched versions: ./configure --enable-depend --prefix=/mnt/data/builds/$(build}/ make -s clean make -s -j4 install 2) create a new cluster (default config), create DB, generate the data 3) restart cluster, drop caches 4) run the query from the SQL script I suspect you don't do (3). I didn't mention this explicitly, my message only said "with uncached data", so maybe that's the problem? >> This is perfectly random data, so a great >> match for the assumptions in costing etc. > > FWIW, I wouldn't say that this is a particularly sympathetic case for > skip scan. It's definitely still a win, but less than other cases I > can imagine. This is due to the relatively large number of rows > returned by the scan. Plus 1000 distinct leading values for a skip > array isn't all that low, so we end up scanning over 1/3 of all of the > leaf pages in the index. > I wasn't suggesting it's a sympathetic case for skipscan. My point is that it perfectly matches the costing assumptions, i.e. columns are independent etc. But if it's not sympathetic, maybe the cost shouldn't be 1/5 of cost from master? > BTW, be careful to distinguish between leaf pages and internal pages > when interpreting "Buffers:" output with the patch. Generally > speaking, the patch repeats many internal page accesses, which needs > to be taken into account when compare "Buffers:" counts against > master. It's not uncommon for 3/4 or even 4/5 of all index page hits > to be for internal pages with the patch. Whereas on master the number > of internal page hits is usually tiny. This is one reason why the > additional context provided by "Index Searches:" can be helpful. > Yeah, I recall there's an issue with that. >> But with uncached data, this runs in ~50 ms on master, but takes almost >> 200 ms with skipscan (these timings are from my laptop, but similar to >> the results). > > Even 50ms seems really slow for your test case -- with or without my > patch applied. > > Are you sure that this wasn't an assert-enabled build? There's lots of > extra assertions for the code paths used by skip scan for this, which > could explain the apparent regression. > > I find that this same query takes only ~2.056 ms with the patch. When > I disabled skip scan locally via "set skipscan_prefix_cols = 0" (which > should give me behavior that's pretty well representative of master), > it takes ~12.039 ms. That's exactly what I'd expect for this query: a > solid improvement, though not the really enormous ones that you'll see > when skip scan is able to avoid reading many of the index pages that > master reads. > I'm pretty sure you're doing this on cached data, because 2ms is exactly the timing I see in that case. regards -- Tomas Vondra
On 9/18/24 20:52, Peter Geoghegan wrote: > On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas@vondra.me> wrote: >> Makes sense. I started with the testing before before even looking at >> the code, so it's mostly a "black box" approach. I did read the 1995 >> paper before that, and the script generates queries with clauses >> inspired by that paper, in particular: > > I think that this approach with black box testing is helpful, but also > something to refine over time. Gray box testing might work best. > >> OK, understood. If it's essentially an independent issue (perhaps even >> counts as a bug?) what about correcting it on master first? Doesn't >> sound like something we'd backpatch, I guess. > > What about backpatching it to 17? > > As things stand, you can get quite contradictory counts of the number > of index scans due to irrelevant implementation details from parallel > index scan. It just looks wrong, particularly on 17, where it is > reasonable to expect near exact consistency between parallel and > serial scans of the same index. > Yes, I think backpatching to 17 would be fine. I'd be worried about maybe disrupting some monitoring in production systems, but for 17 that shouldn't be a problem yet. So fine with me. FWIW I wonder how likely is it that someone has some sort of alerting tied to this counter. I'd bet few people do. It's probably more about a couple people looking at explain plans, but they'll be confused even if we change that only starting with 17. >> Seems like a bit of a mess. IMHO we should either divide everything by >> nloops (so that everything is "per loop", or not divide anything. My >> vote would be to divide, but that's mostly my "learned assumption" from >> the other fields. But having a 50:50 split is confusing for everyone. > > My idea was that it made most sense to follow the example of > "Buffers:", since both describe physical costs. > > Honestly, I'm more than ready to take whatever the path of least > resistance is. If dividing by nloops is what people want, I have no > objections. > I don't have a strong opinion on this. I just know I'd be confused by half the counters being total and half /loop, but chances are other people would disagree. >>> It's worth having skip support (the idea comes from the MDAM paper), >>> but it's not essential. Whether or not an opclass has skip support >>> isn't accounted for by the cost model, but I doubt that it's worth >>> addressing (the cost model is already pessimistic). >>> >> >> I admit I'm a bit confused. I probably need to reread the paper, but my >> impression was that the increment/decrement is required for skipscan to >> work. If we can't do that, how would it generate the intermediate values >> to search for? I imagine it would be possible to "step through" the >> index, but I thought the point of skip scan is to not do that. > > I think that you're probably still a bit confused because the > terminology in this area is a little confusing. There are two ways of > explaining the situation with types like text and numeric (types that > lack skip support). The two explanations might seem to be > contradictory, but they're really not, if you think about it. > > The first way of explaining it, which focuses on how the scan moves > through the index: > > For a text index column "a", and an int index column "b", skip scan > will work like this for a query with a qual "WHERE b = 55": > > 1. Find the first/lowest sorting "a" value in the index. Let's say > that it's "Aardvark". > > 2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly > returning some matches. > > 3. Find the next value after "Aardvark" in the index using a probe > like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say > that it turns out to be "Abacus". > > 4. Look for matches "WHERE a = 'Abacus' and b = 55"... > > ... (repeat these steps until we've exhaustively processed every > existing "a" value in the index)... Ah, OK. So we do probe the index like this. I was under the impression we don't do that. But yeah, this makes sense. > > The second way of explaining it, which focuses on how the skip arrays > advance. Same query (and really the same behavior) as in the first > explanation: > > 1. Skip array's initial value is the sentinel -inf, which cannot > possibly match any real index tuple, but can still guide the search. > So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't > include the "b = 55" part, since it is unnecessary, but conceptually > it's a part of what we search for within _bt_first). > > 2. Find that the index has no "a" values matching -inf (it inevitably > cannot have any matches for -inf), but we do locate the next highest > match. The closest matching value is "Aardvark". The skip array on "a" > therefore advances from -inf to "Aardvark". > > 3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly > returning some matches. > > 4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b = > 55", which will cause us to advance the array on "a" incrementally > inside _bt_advance_array_keys (just like it would if there was a > standard SAOP array on "a" instead). The skip array on "a" therefore > advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use > sentinel values for this text column, which lacks skip support). > > 5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55", > which cannot possibly find matches, but, again, can reposition the > scan as needed. We can't find an exact match, of course, but we do > locate the next closest match -- which is "Abacus", again. So the skip > array now advances from "Aardvark" +infinitesimal to "Abacus". The > sentinel values are made up values, but that doesn't change anything. > (And, again, we don't include the "b = 55" part here, for the same > reason as before.) > > 6. Look for matches "WHERE a = 'Abacus' and b = 55"... > > ...(repeat these steps as many times as required)... > Yeah, this makes more sense. Thanks. > In summary: > > Even index columns that lack skip support get to "increment" (or > "decrement") their arrays by using sentinel values that represent -inf > (or +inf for backwards scans), as well as sentinels that represent > concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal > for backwards scans, say). This scheme sounds contradictory, because > in one sense it allows every skip array to be incremented, but in > another sense it makes it okay that we don't have a type-specific way > to increment values for many individual types/opclasses. > > Inventing these sentinel values allows _bt_advance_array_keys to reason about > arrays without really having to care about which kinds of arrays are > involved, their order relative to each other, etc. In a certain sense, > we don't really need explicit "next key" probes of the kind that the > MDAM paper contemplates, though we do still require the same index > accesses as a design with explicit accesses. > > Does that make sense? > Yes, it does. Most of my confusion was caused by my belief that we can't probe the index for the next value without "incrementing" the current value, but that was a silly idea. > Obviously, if we did add skip support for text, it would be very > unlikely to help performance. Sure, one can imagine incrementing from > "Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but > that isn't very helpful. You're almost certain to end up accessing the > same pages with such a scheme, anyway. What are the chances of an > index with a leading text column actually containing tuples matching > (say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically > zero. Whereas if the column "a" happens to use a discrete type such as > integer or date, then skip support is likely to help: there's a decent > chance that a value generated by incrementing the last value > (and I mean incrementing it for real) will find a real match when > combined with the user-supplied "b" predicate. > > It might be possible to add skip support for text, but there wouldn't > be much point. > Stupid question - so why does it make sense for types like int? There can also be a lot of values between the current and the next value, so why would that be very different from "incrementing" a text value? > >> I think it'd help if I go through the results and try to prepare some >> reproducers, to make it easier for you. After all, it's my script and >> you'd have to reverse engineer some of it. > > Yes, that would be helpful. > > I'll probably memorialize the problem by writing my own minimal test > case for it. I'm using the same TDD approach for this project as was > used for the related Postgres 17 project. > Sure. Still, giving you a reproducer should make it easier . >> This is on exactly the same data, after dropping caches and restarting >> the instance. So there should be no caching effects. Yet, there's a >> pretty clear difference - the total number of buffers is the same, but >> the patched version has many more hits. Yet it's slower. Weird, right? > > Yes, it's weird. It seems likely that you've found an unambiguous bug, > not just a "regular" performance regression. The regressions that I > already know about aren't nearly this bad. So it seems like you have > the right general idea about what to expect, and it seems like your > approach to testing the patch is effective. > Yeah, it's funny. It's not the first time I start stress testing a patch only to stumble over some pre-existing issues ... ;-) regards -- Tomas Vondra
On Fri, Sep 20, 2024 at 9:45 AM Tomas Vondra <tomas@vondra.me> wrote: > 3) restart cluster, drop caches > > 4) run the query from the SQL script > > I suspect you don't do (3). I didn't mention this explicitly, my message > only said "with uncached data", so maybe that's the problem? You're right that I didn't do step 3 here. I'm generally in the habit of using fully cached data when testing this kind of work. The only explanation I can think of is that (at least on your hardware) OS readahead helps the master branch more than skipping helps the patch. That's surprising, but I guess it's possible here because skip scan only needs to access about every third page. And because this particular index was generated by CREATE INDEX, and so happens to have a strong correlation between key space order and physical block order. And probably because this is an index-only scan. > I wasn't suggesting it's a sympathetic case for skipscan. My point is > that it perfectly matches the costing assumptions, i.e. columns are > independent etc. But if it's not sympathetic, maybe the cost shouldn't > be 1/5 of cost from master? The costing is pretty accurate if we assume cached data, though -- which is what the planner will actually assume. In any case, is that really the only problem you see here? That the costing might be inaccurate because it fails to account for some underlying effect, such as the influence of OS readhead? Let's assume for a moment that the regression is indeed due to readahead effects, and that we deem it to be unacceptable. What can be done about it? I have a really hard time thinking of a fix, since by most conventional measures skip scan is indeed much faster here. -- Peter Geoghegan
On 9/20/24 16:21, Peter Geoghegan wrote: > On Fri, Sep 20, 2024 at 9:45 AM Tomas Vondra <tomas@vondra.me> wrote: >> 3) restart cluster, drop caches >> >> 4) run the query from the SQL script >> >> I suspect you don't do (3). I didn't mention this explicitly, my message >> only said "with uncached data", so maybe that's the problem? > > You're right that I didn't do step 3 here. I'm generally in the habit > of using fully cached data when testing this kind of work. > > The only explanation I can think of is that (at least on your > hardware) OS readahead helps the master branch more than skipping > helps the patch. That's surprising, but I guess it's possible here > because skip scan only needs to access about every third page. And > because this particular index was generated by CREATE INDEX, and so > happens to have a strong correlation between key space order and > physical block order. And probably because this is an index-only scan. > Good idea. Yes, it does seem to be due to readahead - if I disable that, the query takes ~320ms on master and ~280ms with the patch. >> I wasn't suggesting it's a sympathetic case for skipscan. My point is >> that it perfectly matches the costing assumptions, i.e. columns are >> independent etc. But if it's not sympathetic, maybe the cost shouldn't >> be 1/5 of cost from master? > > The costing is pretty accurate if we assume cached data, though -- > which is what the planner will actually assume. In any case, is that > really the only problem you see here? That the costing might be > inaccurate because it fails to account for some underlying effect, > such as the influence of OS readhead? > > Let's assume for a moment that the regression is indeed due to > readahead effects, and that we deem it to be unacceptable. What can be > done about it? I have a really hard time thinking of a fix, since by > most conventional measures skip scan is indeed much faster here. > It does seem to be due to readahead, and the costing not accounting for these effects. And I don't think it's unacceptable - I don't think we consider readahead elsewhere, and it certainly is not something I'd expect this patch to fix. So I think it's fine. Ultimately, I think this should be "fixed" by explicitly prefetching pages. My index prefetching patch won't really help, because AFAIK this is about index pages. And I don't know how feasible it is. regards -- Tomas Vondra
On Fri, Sep 20, 2024 at 10:07 AM Tomas Vondra <tomas@vondra.me> wrote: > Yes, I think backpatching to 17 would be fine. I'd be worried about > maybe disrupting some monitoring in production systems, but for 17 that > shouldn't be a problem yet. So fine with me. I'll commit minimal changes to _bt_first that at least make the counters consistent, then. I'll do so soon. > FWIW I wonder how likely is it that someone has some sort of alerting > tied to this counter. I'd bet few people do. It's probably more about a > couple people looking at explain plans, but they'll be confused even if > we change that only starting with 17. On 17 the behavior in this area is totally different, either way. > Ah, OK. So we do probe the index like this. I was under the impression > we don't do that. But yeah, this makes sense. Well, we don't have *explicit* next-key probes. If you think of values like "Aardvark" + infinitesimal as just another array value (albeit one that requires a little special handling in _bt_first), then there are no explicit probes. There are no true special cases required. Maybe this sounds like a very academic point. I don't think that it is, though. Bear in mind that even when _bt_first searches for index tuples matching a value like "Aardvark" + infinitesimal, there's some chance that _bt_search will return a leaf page with tuples that the index scan ultimately returns. And so there really is no "separate explicit probe" of the kind the MDAM paper contemplates. When this happens, we won't get any exact matches for the sentinel search value, but there could still be matches for (say) "WHERE a = 'Abacus' AND b = 55" on that same leaf page. In general, repositioning the scan to later "within" the 'Abacus' index tuples might not be required -- our initial position (based on the sentinel search key) could be "close enough". This outcome is more likely to happen if the query happened to be written "WHERE b = 1", rather than "WHERE b = 55". > Yes, it does. Most of my confusion was caused by my belief that we can't > probe the index for the next value without "incrementing" the current > value, but that was a silly idea. It's not a silly idea, I think. Technically that understanding is fairly accurate -- we often *do* have to "increment" to get to the next value (though reading the next value from an index tuple and then repositioning using it with later/less significant scan keys is the other possibility). Incrementing is always possible, even without skip support, because we can always fall back on +infinitesimal style sentinel values (AKA SK_BT_NEXTPRIOR values). That's the definitional sleight-of-hand that allows _bt_advance_array_keys to not have to think about skip arrays as a special case, regardless of whether or not they happen to have skip support. > > It might be possible to add skip support for text, but there wouldn't > > be much point. > > > > Stupid question - so why does it make sense for types like int? There > can also be a lot of values between the current and the next value, so > why would that be very different from "incrementing" a text value? Not a stupid question at all. You're right; it'd be the same. Obviously, there are at least some workloads (probably most) where any int columns will contain values that are more or less fully contiguous. I also expect there to be some workloads where int columns appear in B-Tree indexes that contain values with large gaps between neighboring values (e.g., because the integers are hash values). We'll always use skip support for any omitted prefix int column (same with any opclass that offers skip support), but we can only expect to see a benefit in the former "dense" cases -- never in the latter "sparse" cases. The MDAM paper talks about an adaptive strategy for dense columns and sparse columns. I don't see any point in that, and assume that it's down to some kind of implementation deficiencies in NonStop SQL back in the 1990s. I can just always use skip support in the hope that integer column data will turn out to be "sparse" because there's no downside to being optimistic about it. The access patterns are exactly the same as they'd be with skip support disabled. My "academic point" about not having *explicit* next-key probes might make more sense now. This is the thing that makes it okay to always be optimistic about types with skip support containing "dense" data. FWIW I actually have skip support for the UUID opclass. I implemented it to have test coverage for pass-by-reference types in certain code paths, but it's otherwise I don't expect it to be useful -- in practice all UUID columns contain "sparse" data. There's still no real downside to it, though. (I wouldn't try to do it with text because it'd be much harder to implement skip support correctly, especially with collated text.) -- Peter Geoghegan
On Fri, Sep 20, 2024 at 10:59 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Sep 20, 2024 at 10:07 AM Tomas Vondra <tomas@vondra.me> wrote: > > Yes, I think backpatching to 17 would be fine. I'd be worried about > > maybe disrupting some monitoring in production systems, but for 17 that > > shouldn't be a problem yet. So fine with me. > > I'll commit minimal changes to _bt_first that at least make the > counters consistent, then. I'll do so soon. Pushed, thanks -- Peter Geoghegan
On Mon, Nov 4, 2024 at 4:58 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > This is a review on v11, not the latest v13. I suspect most comments > still apply, but I haven't verified this. v11 is indeed quite similar to v13, so this shouldn't really matter. > I'm a bit concerned about the additional operations that are being > added to the scan. Before this patch, the amount of work in the > "horizontal" portion of the scan was limited to user-supplied > scankeys, so O(1) even when the index condition is only (f < 7). But, > with this patch, we're adding work for (a=, b=, c=, etc.) for every > tuple in the scan. There's no question that there are still some cases where this cannot possibly pay for itself. And that just isn't acceptable -- no arguments here. > As these new "skip array" keys are primarily useful for inter-page > coordination (by determining if we want to start a primitive scan to > skip to a different page and which value range that primitive scan > would search for, or continue on to the next sibling), can't we only > apply the "skip array" portion of the code at the final tuple we > access on this page? I plan on doing something like that. I'll need to. AFAICT I only need to avoid wasting CPU cycles here -- there are no notable regressions from performing excessive amounts of index descents, as far as I can tell. And so I plan on doing this without fundamentally changing anything about the current design. In particular, I want the performance to remain robust in cases where the best strategy varies significantly as the scan progresses -- even when we need to strongly favor skipping at first, and then strongly favor staying/not skipping later on (all during the same individual index scan). I really like that the current design gets that part right. > While this section already defines some things about index scans which > seem btree-specific, I don't think we should add more references to > btree scan internals in a section about bitmaps and bitmap index > scans. This section of the docs discusses the trade-off between multiple single column indexes, and fewer multi-column indexes. How could skip scan not be relevant to such a discussion? One of the main benefits of skip scan is that it'll allow users to get by with fewer indexes. > > +++ b/src/backend/access/nbtree/nbtree.c > [...] > > - slock_t btps_mutex; /* protects above variables, btps_arrElems */ > > + LWLock btps_lock; /* protects above variables, btps_arrElems */ > > Why is this changed to LWLock, when it's only ever acquired exclusively? In general one should never do more than an extremely small, tightly controlled amount of work with a spinlock held. It's now possible that we'll allocate memory with the lock held -- doing that with a spinlock held is an obvious no-no. We really need to use an LWLock for this stuff now, on general principle. > > +btestimateparallelscan(Relation rel, int nkeys, int norderbys) > > I notice you're using DatumSerialize. Are there reasons why we > wouldn't want to use heap_fill_tuple, which generally produces much > more compact output? heap_fill_tuple doesn't support the notion of -inf and +inf scan key sentinel values. Plus I'm inclined to use DatumSerialize because it's more or less designed for this kind of problem. > Also, I think you can limit the space usage to BLCKSZ in total, > because a full index tuple can't be larger than 1/3rd of a block; and > for skip scans we'll only have known equality bounds for a prefix of > attributes available in the index tuples, and a single (?) > index-produced dynamic attribute we want to skip ahead of. So, IIUC, > at most we'll have 2 index tuples' worth of data, or 2/3 BLCKSZ. > Right? Possibly, but who wants to take a chance? The scheme you're describing only saves memory when there's 3 skip arrays, which is fairly unlikely in general. I think that the approach taken to serializing the array keys should be as conservative as possible. It's not particularly likely that we'll want to do a parallel skip scan. It's rather hard to test those code paths. > I needed to look up what this 'cons up' thing is, as it wasn't > something that I'd seen before. It also seems used exclusively in > btree code, and only after the array keys patch, so I think it'd be > better in general to use 'construct' here. FWIW I wasn't the first person to use the term in the nbtree code. I think you're right, though. It is a needlessly obscure term that is only known to Lisp hackers. I'll fix it. > > +++ b/src/backend/access/nbtree/nbtcompare.c > > The changes here are essentially 6x the same code, but for different > types. What do you think about the attached > 0001-Deduplicate[...].patch.txt, which has the same effect but with > only 1 copy of the code checked in? Reminds me of the approach taken by this extension: https://github.com/petere/pguint I do find the need to write so much boilerplate code for B-Tree opclasses annoying. I also find it annoying that the nbtree code insists on being as forgiving as possible with incomplete opfamilies. But those problems seem out of scope here -- not like I'm really making it any worse. > > +++b/src/backend/access/nbtree/nbtutils.c > [...] > > +_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts) > > Why does this stop processing keys after hitting a row compare? Why not? It's not ideal, but there are a number of things about RowCompare scan keys that are already less than ideal. We don't even try to do any kind of preprocessing that involves RowCompares -- they're already a slightly awkward special case to the nbtree code. > Doesn't skip scan still apply to any subsequent normal keys? E.g. > "c=1" creates a scan "a=skip, b=skip, c=1", so "(a, b)>(1, 2), c=1" > should IMO still allow a skip scan for a=skip, b=1 to be constructed - > it shouldn't be that we get a much less specific (and potentially, > performant) scan just by adding a rowcompare scankey on early > attributes. It's just awkward to get it to work as expected, while still preserving all of the useful properties of the design. Your "(a, b)>(1,2)" scan key returns rows matching: WHERE (a = 1 AND b > 2) OR (a > 1) And so your complete "(a, b)>(1,2) AND c = 1" qual returns rows matching: WHERE ((a = 1 AND b > 2) OR (a > 1)) AND c = 1 It is difficult to imagine how the existing design for skip arrays can be extended to support this kind of qual, though. I guess we'd still need skip arrays on both "a" and "b" here, though. Right? The "b" skip array would be restricted to a range of values between 3 and +inf inclusive, if and only if we were still on the "a" skip array's first array element (i.e. iff "a = 1" the "b" has a valid low_compare). Otherwise (i.e. when "a > 1"), the "b" skip array wouldn't be constrained by any low_compare inequality. So low_compare/high_compare only apply conditionally, in a world where we support these kinds of RowCompare quals. Right now I can avoid the problem by refusing to allow "c = 1" to ever be marked required (by never creating any skip array on an index attribute >= an attribute with a RowCompare key on input). Obviously, the current design of skip arrays involves arrays that are always constrained by the same range/low_compare and high_compare inequalities, independent of any other factor/any wider context. It's not impossible to make something like your RowCompare case work, but it'd be very significantly more complicated than the existing design. Though not all that much more useful. Doing something like this might make sense in the context of a project that adds support for the MDAM paper's "General OR Optimization" transformations -- RowCompare support would only be a bonus. I don't see any opportunities to target RowCompare as independent work -- seems as if RowCompare quals aren't significantly simpler than what is required to support very general MDAM OR optimizations. > > _bt_preprocess_array_keys > > - output_ikey = 0; > > + numArrayKeyData, > > + numSkipArrayKeys; > > I don't think numArrayKeyData/arrayKeyData are good names here, as it > confused me many times reviewing this function's changes. The code in this area actually did change recently, though I didn't announce it. > On a scankey > of a=1,b=2 we won't have any array keys, yet this variable is set to > 2. Something like numOutputKeys is probably more accurate. That's not accurate. _bt_preprocess_array_keys() sets *new_numberOfKeys on output. When there are no array keys (of either type) to be output, _bt_preprocess_array_keys will just return NULL -- it won't have changed the *new_numberOfKeys passed by its _bt_preprocess_keys caller when it returns NULL. In other words, when _bt_preprocess_array_keys determines that there are no array keys to process (neither SAOP arrays nor skip arrays), it won't modify anything, and won't return an alternative input-to-_bt_preprocess_keys scan key array. _bt_preprocess_keys will work directly off of the scan->keyData[] input scan keys passed by the executor proper. > > + /* Create a skip array and scan key where indicated by skipatts */ > > + while (numSkipArrayKeys && > > + attno_skip <= scan->keyData[input_ikey].sk_attno) > > + { > > + Oid opcintype = rel->rd_opcintype[attno_skip - 1]; > > + Oid collation = rel->rd_indcollation[attno_skip - 1]; > > + Oid eq_op = skipatts[attno_skip - 1].eq_op; > > + RegProcedure cmp_proc; > > + > > + if (!OidIsValid(eq_op)) > > + { > > + /* won't skip using this attribute */ > > + attno_skip++; > > Isn't this branch impossible, given that numSkipArrayKeys is output > from _bt_decide_skipatts, whose output won't contain skipped > attributes which have eq_op=InvalidOid? I'd replace this with > Assert(OidIsValid(eq_op)). It's very possible to hit this code path -- we'll hit this code path every time an explicit "=" input scan key appears before an attribute that requires a skip array (which happens whenever there's a third, later column that we'll be able to mark required thanks to the skip array). Note that BTSkipPreproc.eq_op is the "= op to be used to add array, if any". And so when we see !OidIsValid(eq_op) in this loop, it means "this is for an index attribute that we don't want to add a skip array to" (though, as I said, a later attribute is expected to get a skip array when this happens). > > _bt_rewind_nonrequired_arrays > > What types of scan keys can still generate non-required array keys? Right now it's: 1. As you said, certain cases involving RowCompare scan keys. 2. Cases involving B-Tree operator classes that don't even have a "=" operator (yes, technically those are supported!). Also seems possible that I'll end up relying on our support for non-required arrays when I go fix those regressions. Seems possible that I'll get rid of the explicit SK_BT_REQFWD/SK_BT_REQBKWD markings, and go back to treating scan keys as required based on context (a little like how things were prior to commit 7ccaf13a06). > It seems to me those are now mostly impossible, as this patch generates > required skip arrays for all attributes that don't yet have an > equality key and which are ahead of any (in)equality keys, except the > case with row compare keys which I already commented on above. I agree that non-required arrays become an obscure edge case with the patch, having been reasonably common before now (well, they were common in Postgres 17). I don't think that that provides us with any opportunities to get rid of unneeded code. > > utils/skipsupport.[ch] > I'm not sure why this is included in utils - isn't this exclusively > used in access/nbtree/*? The location of skip support is based on (though slightly different to) the location of sort support. In general many B-Tree opclasses are implemented in or around src/utils/adt/*. The exception is all of the stuff in nbtcompare.c, though I always found that weird (I wouldn't mind getting rid of nbtcompare.c by relocating its code places like int.c and int8.c). > > +++ b/src/include/access/nbtree.h > BTArrayKeyInfo explodes in size, from 24B to 88B. I think some of that > is necessary, but should it really be that large? I'm disinclined to do anything about it right now. I'll make a note of it, and review when the most important performance problems are fixed. Thanks for the review -- Peter Geoghegan