Re: Adding skip scan (including MDAM style range skip scan) to nbtree - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
Date | |
Msg-id | 9fddb4f3-22c9-401e-adcc-be46519f58cd@vondra.me Whole thread Raw |
In response to | Re: Adding skip scan (including MDAM style range skip scan) to nbtree (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
|
List | pgsql-hackers |
On 9/18/24 00:14, Peter Geoghegan wrote: > On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas@vondra.me> wrote: >> I've been looking at this patch over the couple last days, mostly doing >> some stress testing / benchmarking (hence the earlier report) and basic >> review. > > Thanks for taking a look! Very helpful. > >> I do have some initial review comments, and the testing produced >> some interesting regressions (not sure if those are the cases where >> skipscan can't really help, that Peter mentioned he needs to look into). > > The one type of query that's clearly regressed in a way that's just > not acceptable are queries where we waste CPU cycles during scans > where it's truly hopeless. For example, I see a big regression on one > of the best cases for the Postgres 17 work, described here: > > https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement > > Notably, these cases access exactly the same buffers/pages as before, > so this really isn't a matter of "doing too much skipping". The number > of buffers hit exactly matches what you'll see on Postgres 17. It's > just that we waste too many CPU cycles in code such as > _bt_advance_array_keys, to uselessly maintain skip arrays. > > I'm not suggesting that there won't be any gray area with these > regressions -- nothing like this will ever be that simple. But it > seems to me like I should go fix these obviously-not-okay cases next, > and then see where that leaves everything else, regressions-wise. That > seems likely to be the most efficient way of dealing with the > regressions. So I'll start there. > > That said, I *would* be surprised if you found a regression in any > query that simply didn't receive any new scan key transformations in > new preprocessing code in places like _bt_decide_skipatts and > _bt_skip_preproc_shrink. I see that many of the queries that you're > using for your stress-tests "aren't really testing skip scan", in this > sense. But I'm hardly about to tell you that you shouldn't spend time > on such queries -- that approach just discovered a bug affecting > Postgres 17 (that was also surprising, but it still happened!). My > point is that it's worth being aware of which test queries actually > use skip arrays in the first place -- it might help you with your > testing. There are essentially no changes to _bt_advance_array_keys > that'll affect traditional SAOP arrays (with the sole exception of > changes made by > v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which > affect every kind of array in the same way). > Makes sense. I started with the testing before before even looking at the code, so it's mostly a "black box" approach. I did read the 1995 paper before that, and the script generates queries with clauses inspired by that paper, in particular: - col = $value - col IN ($values) - col BETWEEN $value AND $value - NOT (clause) - clause [AND|OR] clause There certainly may be gaps and interesting cases the script does not cover. Something to improve. >> 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch >> >> - I find the places that increment "nsearches" a bit random. Each AM >> does it in entirely different place (at least it seems like that to me). >> Is there a way make this a bit more consistent? > > From a mechanical perspective there is nothing at all random about it: > we do this at precisely the same point that we currently call > pgstat_count_index_scan, which in each index AM maps to one descent of > the index. It is at least consistent. Whenever a B-Tree index scan > shows "Index Scans: N", you'll see precisely the same number by > swapping it with an equivalent contrib/btree_gist-based GiST index and > running the same query again (assuming that index tuples that match > the array keys are spread apart in both the B-Tree and GiST indexes). > > (Though I see problems with the precise place that nbtree calls > pgstat_count_index_scan right now, at least in certain edge-cases, > which I discuss below in response to your questions about that.) > OK, understood. FWIW I'm not saying these places are "wrong", just that it feels each AM does that in a very different place. >> uint64 btps_nsearches; /* instrumentation */ >> >> Instrumentation what? What's the counter for? > > Will fix. > > In case you missed it, there is another thread + CF Entry dedicated to > discussing this instrumentation patch: > > https://commitfest.postgresql.org/49/5183/ > https://www.postgresql.org/message-id/flat/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com > Thanks, I wasn't aware of that. >> - I see _bt_first moved the pgstat_count_index_scan, but doesn't that >> mean we skip it if the earlier code does "goto readcomplete"? Shouldn't >> that still count as an index scan? > > In my opinion, no, it should not. > > We're counting the number of times we'll have descended the tree using > _bt_search (or using _bt_endpoint, perhaps), which is a precisely > defined physical cost. A little like counting the number of buffers > accessed. I actually think that this aspect of how we call > pgstat_count_index_scan is a bug that should be fixed, with the fix > backpatched to Postgres 17. Right now, we see completely different > counts for a parallel index scan, compared to an equivalent serial > index scan -- differences that cannot be explained as minor > differences caused by parallel scan implementation details. I think > that it's just wrong right now, on master, since we're simply not > counting the thing that we're supposed to be counting (not reliably, > not if it's a parallel index scan). > OK, understood. If it's essentially an independent issue (perhaps even counts as a bug?) what about correcting it on master first? Doesn't sound like something we'd backpatch, I guess. >> - show_indexscan_nsearches does this: >> >> if (scanDesc && scanDesc->nsearches > 0) >> ExplainPropertyUInteger("Index Searches", NULL, >> scanDesc->nsearches, es); >> >> But shouldn't it divide the count by nloops, similar to (for example) >> show_instrumentation_count? > > I can see arguments for and against doing it that way. It's > ambiguous/subjective, but on balance I favor not dividing by nloops. > You can make a similar argument for doing this with "Buffers: ", and > yet we don't divide by nloops there, either. > > Honestly, I just want to find a way to do this that everybody can live > with. Better documentation could help here. > Seems like a bit of a mess. IMHO we should either divide everything by nloops (so that everything is "per loop", or not divide anything. My vote would be to divide, but that's mostly my "learned assumption" from the other fields. But having a 50:50 split is confusing for everyone. >> 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch >> >> - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation >> rather weird, I'd just call it "oppositeDirCheck". > > Will fix. > >> 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch >> >> - nothing > > Great. I think that I should be able to commit this one soon, since > it's independently useful work. > +1 >> 4) v6-0004-Add-skip-scan-to-nbtree.patch >> >> - indices.sgml seems to hahve typo "Intevening" -> "Intervening" >> >> - It doesn't seem like a good idea to remove the paragraph about >> multicolumn indexes and replace it with just: >> >> Multicolumn indexes should be used judiciously. >> >> I mean, what does judiciously even mean? what should the user consider >> to be judicious? Seems rather unclear to me. Admittedly, the old text >> was not much helpful, but at least it gave some advice. > > Yeah, this definitely needs more work. > >> But maybe more importantly, doesn't skipscan apply only to a rather >> limited subset of data types (that support increment/decrement)? Doesn't >> the new wording mostly ignore that, implying skipscan applies to all >> btree indexes? I don't think it mentions datatypes anywhere, but there >> are many indexes on data types like text, UUID and so on. > > Actually, no, skip scan works in almost the same way with all data > types. Earlier versions of the patch didn't support every data type > (perhaps I should have waited for that before posting my v1), but the > version of the patch you looked at has no restrictions on any data > type. > > You must be thinking of whether or not an opclass has skip support. > That's just an extra optimization, which can be used for a small > handful of discrete data types such as integer and date (hard to > imagine how skip support could ever be implemented for types like > numeric and text). There is a temporary testing GUC that will allow > you to get a sense of how much skip support can help: try "set > skipscan_skipsupport_enabled=off" with (say) my original MDAM test > query to get a sense of that. You'll see more buffer hits needed for > "next key probes", though not dramatically more. > > It's worth having skip support (the idea comes from the MDAM paper), > but it's not essential. Whether or not an opclass has skip support > isn't accounted for by the cost model, but I doubt that it's worth > addressing (the cost model is already pessimistic). > I admit I'm a bit confused. I probably need to reread the paper, but my impression was that the increment/decrement is required for skipscan to work. If we can't do that, how would it generate the intermediate values to search for? I imagine it would be possible to "step through" the index, but I thought the point of skip scan is to not do that. Anyway, probably a good idea for extending the stress testing script. Right now it tests with "bigint" columns only. >> - Very subjective nitpicking, but I find it a bit strange when a comment >> about a block is nested in the block, like in _bt_first() for the >> array->null_elem check. > > Will fix. > >> - assignProcTypes() claims providing skipscan for cross-type scenarios >> doesn't make sense. Why is that? I'm not saying the claim is wrong, but >> it's not clear to me why would that be the case. > > It is just talking about the support function that skip scan can > optionally use, where it makes sense (skip support functions). The > relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is > largely copied from the existing nearby "else if (member->number == > BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In > both stanzas we're talking about a capability that maps to a > particular "input opclass", which means the opclass that maps to the > datums that are stored on disk, in index tuples. > > There are no restrictions on the use of skip scan with queries that > happen to involve the use of cross-type operators. It doesn't even > matter if we happen to be using an incomplete opfamily, since range > skip arrays never need to *directly* take the current array element > from a lower/upper bound inequality scan key's argument. It all > happens indirectly: code in places like _bt_first and _bt_checkkeys > can use inequalities (which are stored in BTArrayKeyInfo.low_compare > and BTArrayKeyInfo.high_compare) to locate the next matching on-disk > index tuple that satisfies the inequality in question. Obviously, the > located datum must be the same type as the one used by the array and > its scan key (it has to be the input opclass type if it's taken from > an index tuple). > > I think that it's a bit silly that nbtree generally bends over > backwards to find a way to execute a scan, given an incomplete > opfamily; in a green field situation it would make sense to just throw > an error instead. Even still, skip scan works in a way that is > maximally forgiving when incomplete opfamilies are used. Admittedly, > it is just about possible to come up with a scenario where we'll now > throw an error for a query that would have worked on Postgres 17. But > that's no different to what would happen if the query had an explicit > "= any( )" non-cross-type array instead of an implicit non-cross-type > skip array. The real problem in these scenarios is the lack of a > suitable cross-type ORDER proc (for a cross-type-operator query) > within _bt_first -- not the lack of cross-type operators. This issue > with missing ORDER procs just doesn't seem worth worrying about, > since, as I said, even slightly different queries (that don't use skip > scan) are bound to throw the same errors either way. > OK. Thanks for the explanation. I'll think about maybe testing such queries too (with cross-type clauses). >> Peter asked me to look at the costing, and I think it looks generally >> sensible. > > I'm glad that you think that I basically have the right idea here. > Hard to know how to approach something like this, which doesn't have > any kind of precedent to draw on. > >> We don't really have a lot of information to base the costing >> on in the first place - the whole point of skipscan is about multicolumn >> indexes, but none of the existing extended statistic seems very useful. >> We'd need some cross-column correlation info, or something like that. > > Maybe, but that would just mean that we'd sometimes be more optimistic > about skip scan helping than we are with the current approach of > pessimistically assuming that there is no correlation at all. Not > clear that being pessimistic in this sense isn't the right thing to > do, despite the fact that it's clearly less accurate on average. > Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming no correlation means pessimistic skipscan costing) in a comment before btcostestimate, or somewhere close. >> There's one thing that I don't quite understand, and that's how >> btcost_correlation() adjusts correlation for multicolumn indexes: >> >> if (index->nkeycolumns > 1) >> indexCorrelation = varCorrelation * 0.75; >> >> That seems fine for a two-column index, I guess. But shouldn't it >> compound for indexes with more keys? I mean, 0.75 * 0.75 for third >> column, etc? I don't think btcostestimate() does that, it just remembers >> whatever btcost_correlation() returns. > > I don't know either. In general I'm out of my comfort zone here. > Don't we do something similar elsewhere? For example, IIRC we do some adjustments when estimating grouping in estimate_num_groups(), and incremental sort had to deal with something similar too. Maybe we could learn something from those places ... (both from the good and bad experiences). >> The only alternative approach I can think of is not to adjust the >> costing for the index scan at all, and only use this to enable (or not >> enable) the skipscan internally. That would mean the overall plan >> remains the same, and maybe sometimes we would think an index scan would >> be too expensive and use something else. Not great, but it doesn't have >> the risk of regressions - IIUC we can disable the skipscan at runtime, >> if we realize it's not really helpful. > > In general I would greatly prefer to not have a distinct kind of index > path for scans that use skip scan. I'm quite keen on a design that > allows the scan to adapt to unpredictable conditions at runtime. > Right. I don't think I've been suggesting having a separate path, I 100% agree it's better to have this as an option for index scan paths. > Of course, that doesn't preclude passing the index scan a hint about > what's likely to work at runtime, based on information figured out > when costing the scan. Perhaps that will prove necessary to avoid > regressing index scans that are naturally quite cheap already -- scans > where we really need to have the right general idea from the start to > avoid any regressions. I'm not opposed to that, provided the index > scan has the ability to change its mind when (for whatever reason) the > guidance from the optimizer turns out to be wrong. > +1 (assuming it's feasible, given the amount of available information) >> As usual, I wrote a bash script to do a bit of stress testing. It >> generates tables with random data, and then runs random queries with >> random predicates on them, while mutating a couple parameters (like >> number of workers) to trigger different plans. It does that on 16, >> master and with the skipscan patch (with the fix for parallel scans). > > I wonder if some of the regressions you see can be tied to the use of > an LWLock in place of the existing use of a spin lock. I did that > because I sometimes need to allocate memory to deserialize the array > keys, with the exclusive lock held. It might be the case that a lot of > these regressions are tied to that, or something else that is far from > obvious...have to investigate. > > In general, I haven't done much on parallel index scans here (I only > added support for them very recently), whereas your testing places a > lot of emphasis on parallel scans. Nothing wrong with that emphasis > (it caught that 17 bug), but just want to put it in context. > Sure. With this kind of testing I don't know what I'm looking for, so I try to cover very wide range of cases. Inevitably, some of the cases will not test the exact subject of the patch. I think it's fine. >> I've uploaded the script and results from the last run here: >> >> https://github.com/tvondra/pg-skip-scan-tests >> >> There's the "run-mdam.sh" script that generates tables/queries, runs >> them, collects all kinds of info about the query, and produces files >> with explain plans, CSV with timings, etc. > > It'll take me a while to investigate all this data. > I think it'd help if I go through the results and try to prepare some reproducers, to make it easier for you. After all, it's my script and you'd have to reverse engineer some of it. >> Anyway, I ran a couple thousand such queries, and I haven't found any >> incorrect results (the script compares that between versions too). So >> that's good ;-) > > That's good! > >> But my main goal was to see how this affects performance. The tables >> were pretty small (just 1M rows, maybe ~80MB), but with restarts and >> dropping caches, large enough to test this. > > The really compelling cases all tend to involve fairly selective index > scans. Obviously, skip scan can only save work by navigating the index > structure more efficiently (unlike loose index scan). So if the main > cost is inherently bound to be the cost of heap accesses, then we > shouldn't expect a big speed up. > >> For example, one of the slowed down queries is query 702 (top of page 8 >> in the PDF). The query is pretty simple: >> >> explain (analyze, timing off, buffers off) >> select id1,id2 from t_1000000_1000_1_2 >> where NOT (id1 in (:list)) AND (id2 = :value); >> >> and it was executed on a table with random data in two columns, each >> with 1000 distinct values. This is perfectly random data, so a great >> match for the assumptions in costing etc. >> >> But with uncached data, this runs in ~50 ms on master, but takes almost >> 200 ms with skipscan (these timings are from my laptop, but similar to >> the results). > > I'll need to investigate this specifically. That does seem odd. > > FWIW, it's a pity that the patch doesn't know how to push down the NOT > IN () here. The MDAM paper contemplates such a scheme. We see the use > of filter quals here, when in principle this could work by using a > skip array that doesn't generate elements that appear in the NOT IN() > list (it'd generate every possible indexable value *except* the given > list/array values). The only reason that I haven't implemented this > yet is because I'm not at all sure how to make it work on the > optimizer side. The nbtree side of the implementation will probably be > quite straightforward, since it's really just a slight variant of a > skip array, that excludes certain values. > >> -- with skipscan >> Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on >> t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16) >> (actual rows=811 loops=1) >> Index Cond: (id2 = 997) >> Index Searches: 1007 >> Filter: (id1 <> ALL ('{983,...,640}'::bigint[])) >> Rows Removed by Filter: 163 >> Heap Fetches: 0 >> Planning Time: 3.730 ms >> Execution Time: 238.554 ms >> (8 rows) >> >> I haven't looked into why this is happening, but this seems like a >> pretty good match for skipscan (on the first column). And for the >> costing too - it's perfectly random data, no correllation, etc. > > I wonder what "Buffers: N" shows? That's usually the first thing I > look at (that and "Index Searches", which looks like what you said it > should look like here). But, yeah, let me get back to you on this. > Yeah, I forgot to get that from my reproducer. But the logs in the github repo with results has BUFFERS - for master (SEQ 12621), the plan looks like this: Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on t_1000000_1000_1_2 (cost=0.96..12179.41 rows=785 width=16) (actual rows=785 loops=1) Index Cond: (id2 = 997) Filter: (id1 <> ALL ('{983, ..., 640}'::bigint[])) Rows Removed by Filter: 181 Heap Fetches: 0 Buffers: shared read=3094 Planning: Buffers: shared hit=93 read=27 Planning Time: 9.962 ms Execution Time: 38.007 ms (10 rows) and with the patch (SEQ 12623) it's this: Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on t_1000000_1000_1_2 (cost=0.96..1745.27 rows=784 width=16) (actual rows=785 loops=1) Index Cond: (id2 = 997) Index Searches: 1002 Filter: (id1 <> ALL ('{983, ..., 640}'::bigint[])) Rows Removed by Filter: 181 Heap Fetches: 0 Buffers: shared hit=1993 read=1029 Planning: Buffers: shared hit=93 read=27 Planning Time: 9.506 ms Execution Time: 179.048 ms (11 rows) This is on exactly the same data, after dropping caches and restarting the instance. So there should be no caching effects. Yet, there's a pretty clear difference - the total number of buffers is the same, but the patched version has many more hits. Yet it's slower. Weird, right? regards -- Tomas Vondra
pgsql-hackers by date: