Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Date | |
Msg-id | CAH2-WzkEyBU9UQM-5GWPcB=WEShAUKcJdvgFuqVHuPuO-iYW0Q@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
|
List | pgsql-hackers |
On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg@bowt.ie> wrote: > > MDAM seems to require exponential storage for "scan key operations" > > for conditions on N columns (to be precise, the product of the number > > of distinct conditions on each column); e.g. an index on mytable > > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ... > > AND h IN (1, 2)" would require 2^8 entries. > What you describe is a problem in theory, but I doubt that it's a > problem in practice. You don't actually have to materialize the > predicates up-front, or at all. Plus you can skip over them using the > next index tuple. So skipping works both ways. Attached is v2, which makes all array key advancement take place using the "next index tuple" approach (using binary searches to find array keys using index tuple values). This approach was necessary for fairly mundane reasons (it limits the amount of work required while holding a buffer lock), but it also solves quite a few other problems that I find far more interesting. It's easy to imagine the state machine from v2 of the patch being extended for skip scan. My approach "abstracts away" the arrays. For skip scan, it would more or less behave as if the user had written a query "WHERE a in (<Every possible value for this column>) AND b = 5 ... " -- without actually knowing what the so-called array keys for the high-order skipped column are (not up front, at least). We'd only need to track the current "array key" for the scan key on the skipped column, "a". The state machine would notice when the scan had reached the next-greatest "a" value in the index (whatever that might be), and then make that the current value. Finally, the state machine would effectively instruct its caller to consider repositioning the scan via a new descent of the index. In other words, almost everything for skip scan would work just like regular SAOPs -- and any differences would be well encapsulated. But it's not just skip scan. This approach also enables thinking of SAOP index scans (using nbtree) as just another type of indexable clause, without any special restrictions (compared to true indexable operators such as "=", say). Particularly in the planner. That was always the general thrust of teaching nbtree about SAOPs, from the start. But it's something that should be totally embraced IMV. That's just what the patch proposes to do. In particular, the patch now: 1. Entirely removes the long-standing restriction on generating path keys for index paths with SAOPs, even when there are inequalities on a high order column present. You can mix SAOPs together with other clause types, arbitrarily, and everything still works and works efficiently. For example, the regression test expected output for this query/test (from bugfix commit 807a40c5) is updated by the patch, as shown here: explain (costs off) SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; - QUERY PLAN --------------------------------------------------------------------------------------- - Sort - Sort Key: thousand - -> Index Scan using tenk1_thous_tenthous on tenk1 - Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) -(4 rows) + QUERY PLAN +-------------------------------------------------------------------------------- + Index Scan using tenk1_thous_tenthous on tenk1 + Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) +(2 rows) We don't need a sort node anymore -- even though the leading column here (thousand) uses an inequality, a particularly tricky case. Now it's an index scan, much like any other, with no particular restrictions caused by using a SAOP. 2. Adds an nbtree strategy for non-required equality array scan keys, which is built on the same state machine, with only minor differences to deal with column values "appearing out of key space order". 3. Simplifies the optimizer side of things by consistently avoiding filter quals (except when it's truly unavoidable). The optimizer doesn't even consider alternative index paths with filter quals for lower-order SAOP columns, because they have no possible advantage anymore. On the other hand, as we saw already, upthread, filter quals have huge disadvantages. By always using true index quals, we automatically avoid any question of getting excessive amounts of heap page accesses just to eliminate non-matching rows. AFAICT we don't need to make a trade-off here. The first version of the patch added some crufty code to the optimizer, to account for various restrictions on sort order. This revised version actually removes existing cruft from the same place (indxpath.c) instead. Items 1, 2, and 3 are all closely related. Take the query I've shown for item 1. Bugfix commit 807a40c5 (which added the test query in question) dealt with an oversight in the then-recent original nbtree SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index scans with an inequality on their leading column, we cannot be sure that the output will appear in the same order as the order that one big continuous index scan returns rows in. We can only expect to maintain the illusion that we're doing one continuous index scan when individual primitive index scans access earlier columns via the equality strategy -- we need "equality constraints". In practice, the optimizer (indxpath.c) is very conservative (more conservative than it really needs to be) when it comes to trusting the index scan to output rows in index order, in the presence of SAOPs. All of that now seems totally unnecessary. Again, I don't see a need to make a trade-off here. My observation about this query (and others like it) is: why not literally perform one continuous index scan instead (not multiple primitive index scans)? That is strictly better, given all the specifics here. Once we have a way to do that (which the nbtree executor work listed under item 2 provides), it becomes safe to assume that the tuples will be output in index order -- there is no illusion left to preserve. Who needs an illusion that isn't actually helping us? We actually do less I/O by using this strategy, for the usual reasons (we can avoid repeating index page accesses). A more concrete benefit of the non-required-scankeys stuff can be seen by running Benoit Tigeot's test case [1] with v2. He had a query like this: SELECT * FROM docs WHERE status IN ('draft', 'sent') AND sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') ORDER BY sent_at DESC NULLS LAST LIMIT 20; And, his test case had an index on "sent_at DESC NULLS LAST, sender_reference, status". This variant was a weak spot for v1. v2 of the patch is vastly more efficient here, since we don't have to go to the heap to eliminate non-matching tuples -- that can happen in the index AM instead. This can easily be 2x-3x faster on a warm cache, and have *hundreds* of times fewer buffer accesses (which Benoit verified with an early version of this v2). All because we now require vastly less heap access -- the quals are fairly selective here, and we have to scan hundreds of leaf pages before the scan can terminate. Avoiding filter quals is a huge win. This particular improvement is hard to squarely attribute to any one of my 3 items. The immediate problem that the query presents us with on the master branch is the problem of filter quals that require heap accesses to do visibility checks (a problem that index quals can never have). That makes it tempting to credit my item 3. But you can't really have item 3 without also having items 1 and 2. Taken together, they eliminate all possible downsides from using index quals. That high level direction (try to have one good choice for the optimizer) seems important to me. Both for this project, and in general. Other changes in v2: * Improved costing, that takes advantage of the fact that nbtree now promises to not repeat any leaf page accesses (unless the scan is restarted or the direction of the scan changes). This makes the worst case far more predictable, and more related to selectivity estimation -- you can't scan more pages than you have in the whole index. Just like with every other sort of index scan. * Support for parallel index scans. The existing approach to array keys for parallel index scan has been adopted to work with individual primitive index scans, not individual array keys. I haven't tested this very thoroughly just yet, but it seems to work well enough already. I think that it's important to not have very much variation between parallel and serial index scans, which I seem to have mostly avoided. [1] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 -- Peter Geoghegan
Attachment
pgsql-hackers by date: