Re: Adding skip scan (including MDAM style range skip scan) to nbtree - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
Date | |
Msg-id | CAH2-Wz=KtVs1zBAnxvVWPq5CyTpO_f+eskOBP88OsF4sOE4M_A@mail.gmail.com Whole thread Raw |
In response to | RE: Adding skip scan (including MDAM style range skip scan) to nbtree (<Masahiro.Ikeda@nttdata.com>) |
Responses |
Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
List | pgsql-hackers |
On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote: > Since I'd like to understand the skip scan to improve the EXPLAIN output > for multicolumn B-Tree Index[1], I began to try the skip scan with some > queries and look into the source code. Thanks for the review! Attached is v3, which generalizes skip scan, allowing it to work with opclasses/types that lack a skip support routine. In other words, v3 makes skip scan work for all types, including continuous types, where it's impractical or infeasible to add skip support. So now important types like text and numeric also get the skip scan optimization (it's not just discrete types like integer and date, as in previous versions). I feel very strongly that everything should be implemented as part of the new skip array abstraction; the patch should only add the concept of skip arrays, which should work just like SAOP arrays. We should avoid introducing any special cases. In short, _bt_advance_array_keys should work in exactly the same way as it does as of Postgres 17 (except for a few representational differences for skip arrays). This seems essential because _bt_advance_array_keys inherently need to be able to trigger moving on to the next skip array value when it reaches the end of a SAOP array (and vice-versa). And so it just makes sense to abstract-away the differences, hiding the difference in lower level code. I have described the new _bt_first behavior that is now available in this new v3 of the patch as "adding explicit next key probes". While v3 does make new changes to _bt_first, it's not really a special kind of index probe. v3 invents new sentinel values instead. The use of sentinels avoids inventing true special cases: the values -inf, +inf, as well as variants of = that use a real datum value, but match on the next key in the index. These new = variants can be thought of as "+infinitesimal" values. So when _bt_advance_array_keys has to "increment" the numeric value 5.0, it sets the scan key to the value "5.0 +infinitesimal". There can never be any matching tuples in the index (just like with -inf sentinel values), but that doesn't matter. So the changes v3 makes to _bt_first doesn't change the basic conceptual model. The added complexity is kept to a manageable level, particularly within _bt_advance_array_keys, which is already very complicated. To help with testing and review, I've added another temporary testing GUC to v3: skipscan_skipsupport_enabled. This can be set to "false" to avoid using skip support, even where available. The GUC makes it easy to measure how skip support routines can help performance (with discrete types like integer and date). > I found the cost is estimated to much higher if the number of skipped attributes > is more than two. Is it expected behavior? Yes and no. Honestly, the current costing is just placeholder code. It is totally inadequate. I'm not surprised that you found problems with it. I just didn't put much work into it, because I didn't really know what to do. > # Test result. The attached file is the detail of tests. > > -- Index Scan > -- The actual time is low since the skip scan works well > -- But the cost is higher than one of seqscan > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991loops=1) > Output: id1, id2, id3, value > Index Cond: (test.id3 = 101) > Buffers: shared hit=4402 > Planning: > Buffers: shared hit=7 > Planning Time: 0.234 ms > Execution Time: 15.711 ms > (8 rows) This is a useful example, because it shows the difficulty with the costing. I ran this query using my own custom instrumentation of the scan. I saw that we only ever manage to skip ahead by perhaps 3 leaf pages at a time, but we still come out ahead. As you pointed out, it's ~7.5x faster than the sequential scan, but not very different to the equivalent full index scan. At least not very different in terms of leaf page accesses. Why should we win by this much, for what seems like a marginal case for skip scan? Even cases where "skipping" doesn't manage to skip any leaf pages can still benefit from skipping *index tuples* -- there is more than one kind of skipping to consider. That is, the patch helps a lot with some (though not all) cases where I didn't really expect that to happen: the Postgres 17 SAOP tuple skipping code (the code in _bt_checkkeys_look_ahead, and the related code in _bt_readpage) helps quite a bit in "marginal" skip scan cases, even though it wasn't really designed for that purpose (it was added to avoid regressions in SAOP array scans for the Postgres 17 work). I find that some queries using my original example test case are about twice as fast as an equivalent full index scan, even when only the fourth and final index column is used in the query predicate. The scan can't even skip a single leaf page at a time, and yet we still win by a nice amount. We win, though it is almost by mistake! This is mostly a good thing. Both for the obvious reason (fast is better than slow), and because it justifies being so aggressive in assuming that skip scan might work out during planning (being wrong without really losing is nice). But there is also a downside: it makes it even harder to model costs at runtime, from within the optimizer. If I measure the actual runtime costs other than runtime (e.g., buffers accesses), I'm not sure that the optimizer is wrong to think that the parallel sequential scan is faster. It looks approximately correct. It is only when we look at runtime that the optimizer's choice looks wrong. Which is...awkward. In general, I have very little idea about how to improve the costing within btcostestimate. I am hoping that somebody has better ideas about it. btcostestimate is definitely the area where the patch is weakest right now. > I look at btcostestimate() to find the reason and found the bound quals > and cost.num_sa_scans are different from my expectation. > > My assumption is > * bound quals is id3=XXX (and id1 and id2 are skipped attributes) > * cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans > per skipped attribute) > > But it's wrong. The above index scan result is > * bound quals is NULL > * cost.num_sa_scans = 1 The logic with cost.num_sa_scans was definitely not what I intended. That's fixed in v3, at least. But the code in btcostestimate is still essentially the same as in earlier versions -- it needs to be completely redesigned (or, uh, designed for the first time). > As I know you said the below, but I'd like to know the above is expected or not. > Currently, there is an assumption that "there will be 10 primitive index scans > per skipped attribute". Is any chance to use pg_stats.n_distinct? It probably makes sense to use pg_stats.n_distinct here. But how? If the problem is that we're too pessimistic, then I think that this will usually (though not always) make us more pessimistic. Isn't that the wrong direction to go in? (We're probably also too optimistic in some cases, but being too pessimistic is a bigger problem in practice.) For example, your test case involved 11 distinct values in each column. The current approach of hard-coding 10 (which is just a temporary hack) should actually make the scan look a bit cheaper than it would if we used the true ndistinct. Another underlying problem is that the existing SAOP costing really isn't very accurate, without skip scan -- that's a big source of the pessimism with arrays/skipping. Why should we be able to get the true number of primitive index scans just by multiplying together each omitted prefix column's ndistinct? That approach is good for getting the worst case, which is probably relevant -- but it's probably not a very good assumption for the average case. (Though at least we can cap the total number of primitive index scans to 1/3 of the total number of pages in the index in btcostestimate, since we have guarantees about the worst case as of Postgres 17.) -- Peter Geoghegan
Attachment
pgsql-hackers by date: