Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Date | |
Msg-id | CAH2-WzmQUhfpSZrvcDmyk1OtydYv=TWRYJdp75m6=9hUaeHbFA@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
|
List | pgsql-hackers |
On Thu, Mar 21, 2024 at 8:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v16. This has some of the changes that you asked for. But > my main focus has been on fixing regressions that were discovered > during recent performance validation. Attached is v17. This revision deals with problems with parallel index scans that use array keys. This patch is now *very* close to being committable. My only outstanding concern is the costing/selfuncs.c aspects of the patch, which I haven't thought about in many months. I'd estimate that I'm now less than a week away from pushing this patch. (Plus I need to decide which tests from my massive test suite should be included in the committed patch, as standard regression tests.) Back to v17. Earlier versions dealt with parallel index scans in a way that could lead to very unbalanced utilization of worker processes (at least with the wrong sort of query/index), which was clearly unacceptable. The underlying issue was the continued use of BTParallelScanDescData.btps_arrayKeyCount field in shared memory, which was still used to stop workers from moving on to the next primitive scan. That old approach made little sense with this new high level design. (I have been aware of the problem for months, but didn't get around to it until now.) I wasn't specifically trying to win any benchmarks here -- I really just wanted to make the design for parallel index scans with SAOPs into a coherent part of the wider design, without introducing any regressions. I applied my usual charter for the patch when working on v17, by more or less removing any nbtree.c parallel index scan code that had direct awareness of array keys as distinct primitive index scans. This fixed the problem of unbalanced worker utilization, as expected, but it also seems to have benefits particular to parallel index scans with array keys -- which was unexpected. We now use an approach that's almost identical to the approach taken by every other type of parallel scan. My new approach naturally reduces contention among backends that want to advance the scan. (Plus parallel index scans get all of the same benefits as serial index scans, of course.) v17 replaces _bt_parallel_advance_array_keys with a new function, called _bt_parallel_primscan_advance, which is now called from _bt_advance_array_keys. The new function doesn't need to increment btscan->btps_arrayKeyCount (nor does it increment a local copy in BTScanOpaqueData.arrayKeyCount). It is called at the specific point that array key advancement *tries* to start another primitive index scan. The difference (relative to the serial case) is that it might not succeed in doing so. It might not be possible to start another primitive index scan, since it might turn out that some other backend (one page ahead, or even one page behind) already did it for everybody. At that point it's easy for the backend that failed to start the next primitive scan to have _bt_advance_array_keys back out of everything. Then the backend just goes back to consuming pages by seizing the scan. This approach preserves the ability of parallel scans to have _bt_readpage release the next page before _bt_readpage even starts examining tuples from the current page. It's possible that 2 or 3 backends (each of which scans its own leaf pages from a group of adjacent pages) will "independently" decide that it's time to start another scan -- but at most one backend can be granted the right to do so. It's even possible that one such backend among several (all of which might be expected to try to start the next primitive scan in unison) will *not* want to do so -- it could know better than the rest. This can happen when the backend lands one page ahead of the rest, and it just so happens to see no reason to not just continue the scan on the leaf level for now. Such a backend can effectively veto the idea of starting another primitive index scan for the whole top-level parallel scan. This probably doesn't really matter much, in and of itself (skipping ahead by only one or two pages is suboptimal but not really a problem), but that's beside the point. The point is that having very flexible rules like this works out to be both simpler and better performing than a more rigid, pessimistic approach would be. In particular, keeping the time that the scan is seized by any one backend to an absolute minimum is important -- it may be better to detect and recover from contention than to try to prevent them altogether. Just to be clear, all of this only comes up during parts of the scan where primitive index scans actually look like a good idea in the first place. It's perfectly possible for the scan to process thousands of array keys, without any backend ever considering starting another primitive index scan. Much of the time, every worker process can independently advance their own private array keys, without that requiring any sort of special coordination (nothing beyond the coordination required by *all* parallel index scans, including those without any SAOP arrays). As I've said many times already, the high level charter of this project is to make scans that have arrays (whether serial or parallel) as similar as possible to any other type of scan. -- Peter Geoghegan
Attachment
pgsql-hackers by date: