Thread: MDAM techniques and Index Skip Scan patch
I returned to the 1995 paper "Efficient Search of Multidimensional B-Trees" [1] as part of the process of reviewing v39 of the skip scan patch, which was posted back in May. It's a great paper, and anybody involved in the skip scan effort should read it thoroughly (if they haven't already). It's easy to see why people get excited about skip scan [2]. But there is a bigger picture here. I don't necessarily expect to come away from this discussion with a much better high level architecture for the patch, or any kind of deeper insight, or even a frame of reference for further discussion. I just think that we ought to *try* to impose some order on this stuff. Like many difficult patches, the skip scan patch is not so much troubled by problems with the implementation as it is troubled by *ambiguity* about the design. Particularly concerning how skip scan meshes with existing designs, as well as future designs -- particularly designs for other MDAM techniques. I've started this thread to have a big picture conversation about how to think about these things. Many other MDAM techniques also seem highly appealing. Much of the MDAM stuff is for data warehousing use-cases, while skip scan/loose index scan is seen as more of an OLTP thing. But they are still related, clearly. I'd like to also talk about another patch, that ISTM had that same quality -- it was also held back by high level design uncertainty. Back in 2018, Tom abandoned a patch that transformed a star-schema style query with left outer joins on dimension tables with OR conditions, into an equivalent query that UNIONs together 2 distinct queries [3][4]. Believe it or not, I am now reminded of that patch by the example of "IN() Lists", from page 5 of the paper. We see this example SQL query: SELECT date, item_class, store, sum(total_sales) FROM sales WHERE date between '06/01/95' and '06/30/95' and item_class IN (20,35,50) and store IN (200,250) GROUP BY dept, date, item_class, store; Granted, this SQL might not seem directly relevant to Tom's patch at first -- there is no join for the optimizer to even try to eliminate, which was the whole basis of Jim Nasby's original complaint, which is what spurred Tom to write the patch in the first place. But hear me out: there is still a fact table (the sales table) with some dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the table (and this SQL query) drives discussion of an optimization involving transforming a predicate with many ORs (which is explicitly said to be logically/semantically equivalent to the IN() lists from the query). They transform the query into a bunch of disjunct clauses that can easily be independently executed, and combined at the end (see also "General OR Optimization" on page 6 of the paper). Also...I'm not entirely sure that the intended underlying "physical plan" is truly free of join-like scans. If you squint just right, you might see something that you could think of as a "physical join" (at least very informally). The whole point of this particular "IN() Lists" example is that we get to the following, for each distinct "dept" and "date" in the table: dept=1, date='06/04/95', item_class=20, store=200 dept=1, date='06/04/95', item_class=20, store=250 dept=1, date='06/04/95', item_class=35, store=200 dept=1, date='06/04/95', item_class=35, store=250 dept=1, date='06/04/95', item_class=50, store=200 dept=1, date='06/04/95', item_class=50, store=250 There are 2400 such accesses in total after transformation -- imagine additional lines like these, for every distinct combination of dept and date (only for those dates that actually had sales, which they enumerate up-front), for store 200 and 250, and item_class 20, 35, and 50. This adds up to 2400 lines in total. Even 2400 index probes will be much faster than a full table scan, given that this is a large fact table. The "sales" table is a clustered index whose keys are on the columns "(dept, date, item_class, store)", per note at the top of page 4. The whole point is to avoid having any secondary indexes on this fact table, without getting a full scan. We can just probe the primary key 2400 times instead, following this transformation. No need for secondary indexes. The plan can be thought of as a DAG, at least informally. This is also somewhat similar to what Tom was thinking about back in 2018. Tom had to deduplicate rows during execution (IIRC using a UNION style ad-hoc approach that sorted on TIDs), whereas I think that they can get away with skipping that extra step. Page 7 says "MDAM removes duplicates before reading the data, so it does not have to do any post read operations to accomplish duplicate elimination (a common problem with OR optimization)". My general concern is that the skip scan patch may currently be structured in a way that paints us into a corner, MDAM-wise. Note that the MDAM paper treats skipping a prefix of columns as a case where the prefix is handled by pretending that there is a clause that looks like this: "WHERE date between -inf AND +inf" -- which is not so different from the original sales SQL query example that I have highlighted. We don't tend to think of queries like this (like my sales query) as in any way related to skip-scan, because we don't imagine that there is any skipping going on. But maybe we should recognize the similarities. BTW, these imaginary -inf/+inf values seem to me to be just like the sentinel values already used inside nbtree, for pivot tuples -- they have explicit -inf values for truncated suffix key columns, and you can think of a rightmost page as having a +inf high key, per the L&Y paper. Wearing my B-Tree hat, I don't see much difference between imaginary -inf/+inf values, and values from the BETWEEN "date" range from the example SQL query. I have in the past wondered if _bt_get_endpoint() should have been implemented that way -- we could go through _bt_search() instead, and get rid of that code. All we need is insertion scan keys that can explicitly contain the same -inf/+inf sentinel values. Maybe this also allows us to get rid of BTScanInsertData.nextkey semantics (not sure offhand). Another more concrete concern about the patch series comes from the backwards scan stuff. This is added by a later patch in the patch series, "v39-0004-Extend-amskip-implementation-for-Btree.patch". It strikes me as a bad thing that we cannot just do leaf-page-at-a-time processing, without usually needing to hold a pin on the leaf page. After all, ordinary backwards scans manage to avoid that today, albeit by using trickery inside _bt_walk_left(). MDAM-style "Maintenance of Index Order" (as described on page 8) seems like a good goal for us here. I don't like the idea of doing ad-hoc duplicate TID elimination inside nbtree, across calls made from the executor (whether it's during backwards skip scans, or at any other time). Not because it seems to go against the approach taken by the MDAM paper (though it does); just because it seems kludgy. (I think that Tom felt the same way about the TID deduplication stuff in his own patch back in 2018, too.) Open question: What does all of this MDAM business mean for ScalarArrayOpExpr, if anything? I freely admit that I could easily be worrying over nothing here. But if I am, I'd really like to know *why* that's the case. [1] http://vldb.org/conf/1995/P710.PDF [2] https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ [3] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com [4] https://www.postgresql.org/message-id/flat/14593.1517581614%40sss.pgh.pa.us#caf373b36332f25acb7673bff331c95e -- Peter Geoghegan
Great to see some interest in the skip scan patch series again! > Like many difficult patches, the skip scan patch is not so much troubled by > problems with the implementation as it is troubled by > *ambiguity* about the design. Particularly concerning how skip scan meshes > with existing designs, as well as future designs -- particularly designs for > other MDAM techniques. I've started this thread to have a big picture > conversation about how to think about these things. Many other MDAM > techniques also seem highly appealing. I think it is good to have this discussion. In my opinion, Postgres could make really good use of some of the described MDAMtechniques. > Much of the MDAM stuff is for data warehousing use-cases, while skip > scan/loose index scan is seen as more of an OLTP thing. But they are still > related, clearly. FWIW I think skip scan is very much data warehousing use-case related - hence why the TimescaleDb people in your [2] referenceimplemented a simple form of it already for their extension. Skip scan is a really useful feature for large datasets. However, I agree it is only one part of the bigger MDAM picture. > > My general concern is that the skip scan patch may currently be structured in > a way that paints us into a corner, MDAM-wise. > One of the concerns I raised before was that the patch may be thinking too simplistically on some things, that would makeit difficult to adopt more complex optimizations in the future. One concrete example can be illustrated by a differentquery on the sales table of the paper's example: SELECT DISTINCT dept, date WHERE item_class = 100 This should skip with prefix of (dept, date). Suppose we're at (dept, date) = (1, 2021-01-01) and it's skipping to the nextprefix, the patch just implements what the MDAM paper describes as the 'probing' step. It finds the beginning of thenext prefix. This could be for example (dept, date, item_class) = (1, 2021-01-02, 1). From there onwards, it would justscan the index until it finds item_class=100. What it should do however, is to first 'probe' for the next prefix valueand then skip directly to (1, 2021-01-02, 100) (skipping item_class 1-99 altogether). The problem if it doesn't supportthis is that skip scan could have a quite unpredictable performance, because sometimes it'll end up going throughmost of the index where it should be skipping. A while ago, I spent quite some time trying to come up with an implementation that would work in this more general case.The nice thing is that with such a more generic implementation, you get almost all the features from the MDAM paperalmost for free. I focused on the executor code, not on the planner code - the planner code is for the DISTINCT skippart very similar to the original patch and I hacked in a way to make it choose a 'skip scan' also for non-DISTINCT queriesfor testing purposes. For this discussion about MDAM, the planner part is less relevant though. There's still a lotof discussion and work on the planner-side too, but I think that is completely independent from each other. The more generic patch I originally posted in [1], together with some technical considerations. That was quite a while agoso it obviously doesn't apply anymore on master. Therefore, I've attached a rebased version. Unfortunately, it's stillbased on an older version of the UniqueKeys patch - but since that patch is all planner machinery as well, it doesn'tmatter so much for the discussion about the executor code either. I believe if we want something that fits better with future MDAM use cases, we should take a closer look at the executorcode of this patch to drive this discussion. The logic is definitely more complex than the original patch, howeverI believe it is also more flexible and future-proof. > > Another more concrete concern about the patch series comes from the > backwards scan stuff. This is added by a later patch in the patch series, "v39- > 0004-Extend-amskip-implementation-for-Btree.patch". It strikes me as a bad > thing that we cannot just do leaf-page-at-a-time processing, without usually > needing to hold a pin on the leaf page. > After all, ordinary backwards scans manage to avoid that today, albeit by > using trickery inside _bt_walk_left(). MDAM-style "Maintenance of Index > Order" (as described on page 8) seems like a good goal for us here. I don't > like the idea of doing ad-hoc duplicate TID elimination inside nbtree, across > calls made from the executor (whether it's during backwards skip scans, or at > any other time). Not because it seems to go against the approach taken by > the MDAM paper (though it does); just because it seems kludgy. (I think that > Tom felt the same way about the TID deduplication stuff in his own patch > back in 2018, > too.) It's good to mention that the patch I attached does proper 'leaf-page-at-a-time' processing, so it avoids the problem youdescribe with v39. It is implemented instead in the same way as a "regular" index scan - we process the full leaf pageand store the matched tuples in the local state. If a DISTINCT scan wants to do a skip, we check our local state firstif that skipping would be possible with the matched tuples from the current page. Then we avoid double work and alsothe need to look at the same page again. > Open question: What does all of this MDAM business mean for > ScalarArrayOpExpr, if anything? > This is a really interesting combination actually. I think, ideally, you'd probably get rid of it and provide full supportfor that with the 'skip' based approach (essentially the ScalarArrayOpExpr seems to do some form of skipping already- it transforms x IN (1,2,3) into 3 separate index scans for x). However, even without doing any work on it, it actually interacts nicely with the skip based approach. As an example, here's some queries based on the 'sales' table of the paper with some data in it (18M rows total, see sales_query.sqlattachment for the full example): -- terminology from paper: "intervening range predicate" select date, sum(total_sales) from sales where dept between 2 and 3 and date between '2021-01-05' and '2021-01-10' and item_class=20 and store=250 group by dept, date ; Patch: Execution Time: 0.368 ms Master: Execution Time: 416.792 ms -- terminology from paper: "missing key predicate" select date, sum(total_sales) from sales where date between '2021-01-05' and '2021-01-10' and item_class=20 and store=250 group by dept, date ; Patch: Execution Time: 0.667 ms Master: Execution Time: 654.684 ms -- terminology from paper: "IN lists" -- this is similar to the query from your example with an IN list -- in the current patch, this query is done with a skip scan with skip prefix (dept, date) and then the ScalarOpArray foritem_class=(20,30,50) select date, sum(total_sales) from sales where date between '2021-01-05' and '2021-01-10' and item_class in (20, 30, 50) and store=250 group by dept, date ; Patch: Execution Time: 1.767 ms Master: Execution Time: 629.792 ms The other mentioned MDAM optimizations in the paper (NOT =, general OR) are not implemented but don't seem to be conflictingat all with the current implementation - they seem to be just a matter of transforming the expressions into theright form. From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series, which focuses on the planner. v2-0001 is Dmitry'spatch that extends it to make it possible to use UniqueKeys for the skip scan. Both of these are unfortunately stillolder patches, but because they are planner machinery they are less relevant to the discussion about the executor here. Patch v2-0002 contains most of my work and introduces all the executor logic for the skip scan and hooks up the planner forDISTINCT queries to use the skip scan. Patch v2-0003 is a planner hack that makes the planner pick a skip scan on virtually every possibility. This also enablesthe skip scan on the queries above that don't have a DISTINCT but where it could be useful. -Floris [1] https://www.postgresql.org/message-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com
Attachment
> On Thu, Oct 21, 2021 at 07:16:00PM -0700, Peter Geoghegan wrote: > > My general concern is that the skip scan patch may currently be > structured in a way that paints us into a corner, MDAM-wise. > > Note that the MDAM paper treats skipping a prefix of columns as a case > where the prefix is handled by pretending that there is a clause that > looks like this: "WHERE date between -inf AND +inf" -- which is not so > different from the original sales SQL query example that I have > highlighted. We don't tend to think of queries like this (like my > sales query) as in any way related to skip-scan, because we don't > imagine that there is any skipping going on. But maybe we should > recognize the similarities. To avoid potential problems with extensibility in this sense, the implementation needs to explicitly work with sets of disjoint intervals of values instead of simple prefix size, one set of intervals per scan key. An interesting idea, doesn't seem to be a big change in terms of the patch itself.
Hi Peter, On 10/21/21 22:16, Peter Geoghegan wrote: > I returned to the 1995 paper "Efficient Search of Multidimensional > B-Trees" [1] as part of the process of reviewing v39 of the skip scan > patch, which was posted back in May. It's a great paper, and anybody > involved in the skip scan effort should read it thoroughly (if they > haven't already). It's easy to see why people get excited about skip > scan [2]. But there is a bigger picture here. > Thanks for starting this thread ! The Index Skip Scan patch could affect a lot of areas, so keeping MDAM in mind is definitely important. However, I think the key part to progress on the "core" functionality (B-tree related changes) is to get the planner functionality in place first. Hopefully we can make progress on that during the November CommitFest based on Andy's patch. Best regards, Jesper
Hi, On Sat, Oct 23, 2021 at 07:30:47PM +0000, Floris Van Nee wrote: > > From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series, > which focuses on the planner. v2-0001 is Dmitry's patch that extends it to > make it possible to use UniqueKeys for the skip scan. Both of these are > unfortunately still older patches, but because they are planner machinery > they are less relevant to the discussion about the executor here. Patch > v2-0002 contains most of my work and introduces all the executor logic for > the skip scan and hooks up the planner for DISTINCT queries to use the skip > scan. Patch v2-0003 is a planner hack that makes the planner pick a skip > scan on virtually every possibility. This also enables the skip scan on the > queries above that don't have a DISTINCT but where it could be useful. The patchset doesn't apply anymore: http://cfbot.cputube.org/patch_36_1741.log === Applying patches on top of PostgreSQL commit ID a18b6d2dc288dfa6e7905ede1d4462edd6a8af47 === === applying patch ./v2-0001-Extend-UniqueKeys.patch [...] patching file src/include/optimizer/paths.h Hunk #2 FAILED at 299. 1 out of 2 hunks FAILED -- saving rejects to file src/include/optimizer/paths.h.rej Could you send a rebased version? In the meantime I will change the status on the cf app to Waiting on Author.
> > Could you send a rebased version? In the meantime I will change the status > on the cf app to Waiting on Author. Attached a rebased version.
Attachment
> On Thu, Jan 13, 2022 at 03:27:08PM +0000, Floris Van Nee wrote: > > > > Could you send a rebased version? In the meantime I will change the status > > on the cf app to Waiting on Author. > > Attached a rebased version. FYI, I've attached this thread to the CF item as an informational one, but as there are some patches posted here, folks may get confused. For those who have landed here with no context, I feel obliged to mention that now there are two alternative patch series posted under the same CF item: * the original one lives in [1], waiting for reviews since the last May * an alternative one posted here from Floris [1]: https://www.postgresql.org/message-id/flat/20200609102247.jdlatmfyeecg52fi@localhost
Hi, On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote: > > FYI, I've attached this thread to the CF item as an informational one, > but as there are some patches posted here, folks may get confused. For > those who have landed here with no context, I feel obliged to mention > that now there are two alternative patch series posted under the same > CF item: > > * the original one lives in [1], waiting for reviews since the last May > * an alternative one posted here from Floris Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed. It's nice to have the alternative approach threads linkied in the commit fest, but it seems that the cfbot will use the most recent attachments as the only patchset, thus leaving the "original" one untested. I'm not sure of what's the best approach in such situation. Maybe creating a different CF entry for each alternative, and link the other cf entry on the cf app using the "Add annotations" or "Links" feature rather than attaching threads?
> On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote: > Hi, > > On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote: > > > > FYI, I've attached this thread to the CF item as an informational one, > > but as there are some patches posted here, folks may get confused. For > > those who have landed here with no context, I feel obliged to mention > > that now there are two alternative patch series posted under the same > > CF item: > > > > * the original one lives in [1], waiting for reviews since the last May > > * an alternative one posted here from Floris > > Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed. > It's nice to have the alternative approach threads linkied in the commit fest, > but it seems that the cfbot will use the most recent attachments as the only > patchset, thus leaving the "original" one untested. > > I'm not sure of what's the best approach in such situation. Maybe creating a > different CF entry for each alternative, and link the other cf entry on the cf > app using the "Add annotations" or "Links" feature rather than attaching > threads? I don't mind having all of the alternatives under the same CF item, only one being tested seems to be only a small limitation of cfbot.
Hi, On 2022-01-22 22:37:19 +0100, Dmitry Dolgov wrote: > > On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote: > > Hi, > > > > On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote: > > > > > > FYI, I've attached this thread to the CF item as an informational one, > > > but as there are some patches posted here, folks may get confused. For > > > those who have landed here with no context, I feel obliged to mention > > > that now there are two alternative patch series posted under the same > > > CF item: > > > > > > * the original one lives in [1], waiting for reviews since the last May > > > * an alternative one posted here from Floris > > > > Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed. > > It's nice to have the alternative approach threads linkied in the commit fest, > > but it seems that the cfbot will use the most recent attachments as the only > > patchset, thus leaving the "original" one untested. > > > > I'm not sure of what's the best approach in such situation. Maybe creating a > > different CF entry for each alternative, and link the other cf entry on the cf > > app using the "Add annotations" or "Links" feature rather than attaching > > threads? > > I don't mind having all of the alternatives under the same CF item, only > one being tested seems to be only a small limitation of cfbot. IMO it's pretty clear that having "duelling" patches below one CF entry is a bad idea. I think they should be split, with inactive approaches marked as returned with feeback or whatnot. Either way, currently this patch fails on cfbot due to a new GUC: https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs https://cirrus-ci.com/task/5134905372835840 Greetings, Andres Freund
> On Mon, Mar 21, 2022 at 06:34:09PM -0700, Andres Freund wrote: > > > I don't mind having all of the alternatives under the same CF item, only > > one being tested seems to be only a small limitation of cfbot. > > IMO it's pretty clear that having "duelling" patches below one CF entry is a > bad idea. I think they should be split, with inactive approaches marked as > returned with feeback or whatnot. On the other hand even for patches with dependencies (i.e. the patch A depends on the patch B) different CF items cause a lot of confusion for reviewers. I guess for various flavours of the same patch it would be even worse. But I don't have a strong opinion here. > Either way, currently this patch fails on cfbot due to a new GUC: > https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs > https://cirrus-ci.com/task/5134905372835840 This seems to be easy to solve.
Attachment
On Tue, Mar 22, 2022 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > IMO it's pretty clear that having "duelling" patches below one CF entry is a > bad idea. I think they should be split, with inactive approaches marked as > returned with feeback or whatnot. I have the impression that this thread is getting some value from having a CF entry, as a multi-person collaboration where people are trading ideas and also making progress that no one wants to mark as returned, but it's vexing for people managing the CF because it's not really proposed for 15. Perhaps what we lack is a new status, "Work In Progress" or something?
Peter Geoghegan <pg@bowt.ie> writes: > Like many difficult patches, the skip scan patch is not so much > troubled by problems with the implementation as it is troubled by > *ambiguity* about the design. Particularly concerning how skip scan > meshes with existing designs, as well as future designs -- > particularly designs for other MDAM techniques. I've started this > thread to have a big picture conversation about how to think about > these things. Peter asked me off-list to spend some time thinking about the overall direction we ought to be pursuing here. I have done that, and here are a few modest suggestions. 1. Usually I'm in favor of doing this sort of thing in an index AM agnostic way, but here I don't see much point. All of the ideas at stake rely fundamentally on having a lexicographically-ordered multi column index; but we don't have any of those except btree, nor do I think we're likely to get any soon. This motivates the general tenor of my remarks below, which is "do it in access/nbtree/ not in the planner". 2. The MDAM paper Peter cited is really interesting. You can see fragments of those ideas in our existing btree code, particularly in the scan setup stuff that detects redundant or contradictory keys and determines a scan start strategy. The special handling we implemented awhile ago for ScalarArrayOp index quals is also a subset of what they were talking about. It seems to me that if we wanted to implement more of those ideas, the relevant work should almost all be done in nbtree proper. The planner would need only minor adjustments: btcostestimate would have to be fixed to understand the improvements, and there are some rules in indxpath.c that prevent us from passing "too complicated" sets of indexquals to the AM, which would need to be relaxed or removed altogether. 3. "Loose" indexscan (i.e., sometimes re-descend from the tree root to find the next index entry) is again something that seems like it's mainly nbtree's internal problem. Loose scan is interesting if we have index quals for columns that are after the first column that lacks an equality qual, otherwise not. I've worried in the past that we'd need planner/statistical support to figure out whether a loose scan is likely to be useful compared to just plowing ahead in the index. However, that seems to be rendered moot by the idea used in the current patchsets, ie scan till we find that we'll have to step off the current page, and re-descend at that point. (When and if we find that that heuristic is inadequate, we could work on passing some statistical data forward. But we don't need any in the v1 patch.) Again, we need some work in btcostestimate to understand how the index scan cost will be affected, but I still don't see any pressing need for major planner changes or plan tree contents changes. 4. I find each of the above ideas to be far more attractive than optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really understand why the current patchsets seem to be driven largely by that single use-case. I wouldn't even bother with that for the initial patch. When we do get around to it, it still doesn't need major planner support, I think --- again fixing the cost estimation is the bulk of the work. Munro's original 2014 patch showed that we don't really need all that much to get the planner to build such a plan; the problem is to convince it that that plan will be cheap. In short: I would throw out just about all the planner infrastructure that's been proposed so far. It looks bulky, expensive, and drastically undercommented, and I don't think it's buying us anything of commensurate value. The part of the planner that actually needs serious thought is btcostestimate, which has been woefully neglected in both of the current patchsets. BTW, I've had a bee in my bonnet for a long time about whether some of nbtree's scan setup work could be done once during planning, rather than over again during each indexscan start. This issue might become more pressing if the work becomes significantly more complicated/expensive, which these ideas might cause. But it's a refinement that could be left for later --- and in any case, the responsibility would still fundamentally be nbtree's. I don't think the planner would do more than call some AM routine that could add decoration to an IndexScan plan node. Now ... where did I put my flameproof vest? regards, tom lane
Hi, On 2022-03-22 16:55:49 -0400, Tom Lane wrote: > 4. I find each of the above ideas to be far more attractive than > optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really > understand why the current patchsets seem to be driven largely > by that single use-case. It's something causing plenty pain in production environments... Obviously it'd be even better if the optimization also triggered in cases like SELECT some_indexed_col FROM blarg GROUP BY some_indexed_col which seems to be what ORMs like to generate. > BTW, I've had a bee in my bonnet for a long time about whether some of > nbtree's scan setup work could be done once during planning, rather than > over again during each indexscan start. It does show up in simple-index-lookup heavy workloads. Not as a major thing, but it's there. And it's just architecturally displeasing :) Are you thinking of just moving the setup stuff in nbtree (presumably parts of _bt_first() / _bt_preprocess_keys()) or also stuff in ExecIndexBuildScanKeys()? The latter does show up a bit more heavily in profiles than nbtree specific setup, and given that it's generic executor type stuff, seems even more amenable to being moved to plan time. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-03-22 16:55:49 -0400, Tom Lane wrote: >> BTW, I've had a bee in my bonnet for a long time about whether some of >> nbtree's scan setup work could be done once during planning, rather than >> over again during each indexscan start. > It does show up in simple-index-lookup heavy workloads. Not as a major thing, > but it's there. And it's just architecturally displeasing :) > Are you thinking of just moving the setup stuff in nbtree (presumably parts of > _bt_first() / _bt_preprocess_keys()) or also stuff in > ExecIndexBuildScanKeys()? Didn't really have specifics in mind. The key stumbling block is that some (not all) of the work depends on knowing the specific values of the indexqual comparison keys, so while you could do that work in advance for constant keys, you'd still have to be prepared to do work at scan start for non-constant keys. I don't have a clear idea about how to factorize that effectively. A couple of other random ideas in this space: * I suspect that a lot of this work overlaps with the efforts that btcostestimate makes along the way to getting a cost estimate. So it's interesting to wonder whether we could refactor so that btcostestimate is integrated with this hypothetical plan-time key preprocessing and doesn't duplicate work. * I think that we run through most or all of that preprocessing logic even for internal catalog accesses, where we know darn well how the keys are set up. We ought to think harder about how we could short-circuit pointless work in those code paths. I don't think any of this is an essential prerequisite to getting something done for loose index scans, which ISTM ought to be the first point of attack for v16. Loose index scans per se shouldn't add much to the key preprocessing costs. But these ideas likely would be useful to look into before anyone starts on the more complicated preprocessing that would be needed for the ideas in the MDAM paper. regards, tom lane
> On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > Like many difficult patches, the skip scan patch is not so much > > troubled by problems with the implementation as it is troubled by > > *ambiguity* about the design. Particularly concerning how skip scan > > meshes with existing designs, as well as future designs -- > > particularly designs for other MDAM techniques. I've started this > > thread to have a big picture conversation about how to think about > > these things. > > Peter asked me off-list to spend some time thinking about the overall > direction we ought to be pursuing here. I have done that, and here > are a few modest suggestions. Thanks. To make sure I understand your proposal better, I have a couple of questions: > In short: I would throw out just about all the planner infrastructure > that's been proposed so far. It looks bulky, expensive, and > drastically undercommented, and I don't think it's buying us anything > of commensurate value. Broadly speaking planner related changes proposed in the patch so far are: UniqueKey, taken from the neighbour thread about select distinct; list of uniquekeys to actually pass information about the specified loose scan prefix into nbtree; some verification logic to prevent applying skipping when it's not supported. I can imagine taking out UniqueKeys and passing loose scan prefix in some other form (the other parts seems to be essential) -- is that what you mean?
Dmitry Dolgov <9erthalion6@gmail.com> writes: > On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote: >> In short: I would throw out just about all the planner infrastructure >> that's been proposed so far. It looks bulky, expensive, and >> drastically undercommented, and I don't think it's buying us anything >> of commensurate value. > Broadly speaking planner related changes proposed in the patch so far > are: UniqueKey, taken from the neighbour thread about select distinct; > list of uniquekeys to actually pass information about the specified > loose scan prefix into nbtree; some verification logic to prevent > applying skipping when it's not supported. I can imagine taking out > UniqueKeys and passing loose scan prefix in some other form (the other > parts seems to be essential) -- is that what you mean? My point is that for pure loose scans --- that is, just optimizing a scan, not doing AM-based duplicate-row-elimination --- you do not need to pass any new data to btree at all. It can infer what to do on the basis of the set of index quals it's handed. The bigger picture here is that I think the reason this patch series has failed to progress is that it's too scattershot. You need to pick a minimum committable feature and get that done, and then you can move on to the next part. I think the minimum committable feature is loose scans, which will require a fair amount of work in access/nbtree/ but very little new planner code, and will be highly useful in their own right even if we never do anything more. In general I feel that the UniqueKey code is a solution looking for a problem, and that treating it as the core of the patchset is a mistake. We should be driving this work off of what nbtree needs to make progress, and not building more infrastructure elsewhere than we have to. Maybe we'll end up with something that looks like UniqueKeys, but I'm far from convinced of that. regards, tom lane
> On Wed, Mar 23, 2022 at 05:32:46PM -0400, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote: > >> In short: I would throw out just about all the planner infrastructure > >> that's been proposed so far. It looks bulky, expensive, and > >> drastically undercommented, and I don't think it's buying us anything > >> of commensurate value. > > > Broadly speaking planner related changes proposed in the patch so far > > are: UniqueKey, taken from the neighbour thread about select distinct; > > list of uniquekeys to actually pass information about the specified > > loose scan prefix into nbtree; some verification logic to prevent > > applying skipping when it's not supported. I can imagine taking out > > UniqueKeys and passing loose scan prefix in some other form (the other > > parts seems to be essential) -- is that what you mean? > > My point is that for pure loose scans --- that is, just optimizing a scan, > not doing AM-based duplicate-row-elimination --- you do not need to pass > any new data to btree at all. It can infer what to do on the basis of the > set of index quals it's handed. > > The bigger picture here is that I think the reason this patch series has > failed to progress is that it's too scattershot. You need to pick a > minimum committable feature and get that done, and then you can move on > to the next part. I think the minimum committable feature is loose scans, > which will require a fair amount of work in access/nbtree/ but very little > new planner code, and will be highly useful in their own right even if we > never do anything more. > > In general I feel that the UniqueKey code is a solution looking for a > problem, and that treating it as the core of the patchset is a mistake. > We should be driving this work off of what nbtree needs to make progress, > and not building more infrastructure elsewhere than we have to. Maybe > we'll end up with something that looks like UniqueKeys, but I'm far from > convinced of that. I see. I'll need some thinking time about how it may look like (will probably return with more questions). The CF item could be set to RwF, what would you say, Jesper?
On 3/23/22 18:22, Dmitry Dolgov wrote: > > The CF item could be set to RwF, what would you say, Jesper? > We want to thank the community for the feedback that we have received over the years for this feature. Hopefully a future implementation can use Tom's suggestions to get closer to a committable solution. Here is the last CommitFest entry [1] for the archives. RwF [1] https://commitfest.postgresql.org/37/1741/ Best regards, Dmitry & Jesper
On Tue, Mar 22, 2022 at 4:06 PM Andres Freund <andres@anarazel.de> wrote: > Are you thinking of just moving the setup stuff in nbtree (presumably parts of > _bt_first() / _bt_preprocess_keys()) or also stuff in > ExecIndexBuildScanKeys()? > > The latter does show up a bit more heavily in profiles than nbtree specific > setup, and given that it's generic executor type stuff, seems even more > amenable to being moved to plan time. When I was working on the patch series that became the nbtree Postgres 12 work, this came up. At one point I discovered that using palloc0() for the insertion scankey in _bt_first() was a big problem with nested loop joins -- it became a really noticeable bottleneck with one of my test cases. I independently discovered what Tom must have figured out back in 2005, when he committed d961a56899. This led to my making the new insertion scan key structure (BTScanInsertData) not use dynamic allocation. So _bt_first() is definitely performance critical for certain types of queries. We could get rid of dynamic allocations for BTStackData in _bt_first(), perhaps. The problem is that there is no simple, reasonable proof of the maximum height on a B-tree, even though a B-Tree with more than 7 or 8 levels seems extraordinarily unlikely. You could also invent a slow path (maybe do what we do in _bt_insert_parent() in the event of a concurrent root page split/NULL stack), but that runs into the problem of being awkward to test, and pretty ugly. It's doable, but I wouldn't do it unless there was a pretty noticeable payoff. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > We could get rid of dynamic allocations for BTStackData in > _bt_first(), perhaps. The problem is that there is no simple, > reasonable proof of the maximum height on a B-tree, even though a > B-Tree with more than 7 or 8 levels seems extraordinarily unlikely. Start with a few entries preallocated, and switch to dynamically allocated space if there turn out to be more levels than that, perhaps? Not sure if it's worth the trouble. In any case, what I was on about is _bt_preprocess_keys() and adjacent code. I'm surprised that those aren't more expensive than one palloc in _bt_first. Maybe that logic falls through very quickly in simple cases, though. regards, tom lane
On Tue, Mar 22, 2022 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter asked me off-list to spend some time thinking about the overall > direction we ought to be pursuing here. Thanks for taking a look! "5.5 Exploiting Key Prefixes" and "5.6 Ordered Retrieval" from "Modern B-Tree Techniques" are also good, BTW. The terminology in this area is a mess. MySQL calls SELECT-DISTINCT-that-matches-an-index "loose index scans". I think that you're talking about skip scan when you say "loose index scan". Skip scan is where there is an omitted prefix of columns in the SQL query -- omitted columns after the first column that lack an equality qual. Pretty sure that MySQL/InnoDB can't do that -- it can only "skip" to the extent required to make SELECT-DISTINCT-that-matches-an-index perform well, but that's about it. It might be useful for somebody to go write a "taxonomy of MDAM techniques", or a glossary. The existing "Loose indexscan" Postgres wiki page doesn't seem like enough. Something very high level and explicit, with examples, just so we don't end up talking at cross purposes too much. > 1. Usually I'm in favor of doing this sort of thing in an index AM > agnostic way, but here I don't see much point. All of the ideas at > stake rely fundamentally on having a lexicographically-ordered multi > column index; but we don't have any of those except btree, nor do > I think we're likely to get any soon. This motivates the general > tenor of my remarks below, which is "do it in access/nbtree/ not in > the planner". That was my intuition all along, but I didn't quite have the courage to say so -- sounds too much like something that an optimizer dilettante like me would be expected to say. :-) Seems like one of those things where lots of high level details intrinsically need to be figured out on-the-fly, at execution time, rather than during planning. Perhaps it'll be easier to correctly determine that a skip scan plan is the cheapest in practice than to accurately cost skip scan plans. If the only alternative is a sequential scan, then perhaps a very approximate cost model will work well enough. It's probably way too early to tell right now, though. > I've worried in the past that we'd > need planner/statistical support to figure out whether a loose scan > is likely to be useful compared to just plowing ahead in the index. I don't expect to be able to come up with a structure that leaves no unanswered questions about future MDAM work -- it's not realistic to expect everything to just fall into place. But that's okay. Just having everybody agree on roughly the right conceptual model is the really important thing. That now seems quite close, which I count as real progress. > 4. I find each of the above ideas to be far more attractive than > optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really > understand why the current patchsets seem to be driven largely > by that single use-case. I wouldn't even bother with that for the > initial patch. I absolutely agree. I wondered about that myself in the past. My best guess is that a certain segment of users are familiar with SELECT-DISTINCT-that-matches-an-index from MySQL. And so to some extent application frameworks evolved in a world where that capability existed. IIRC Jesper once said that Hibernate relied on this capability. It's probably a lot easier to implement SELECT-DISTINCT-that-matches-an-index if you have the MySQL storage engine model, with concurrency control that's typically based on two-phase locking. I think that MySQL does some amount of deduplication in its executor here -- and *not* in what they call the storage engine. This is exactly what I'd like to avoid in Postgres; as I said "Maintenance of Index Order" (as the paper calls it) seems important, and not something to be added later on. Optimizer and executor layers that each barely know the difference between a skip scan and a full index scan seems like something we might actually want to aim for, rather than avoid. Teaching nbtree to transform quals into ranges sounds odd at first, but it seems like the right approach now, on balance -- that's the only *good* way to maintain index order. (Maintaining index order is needed to avoid needing or relying on deduplication in the executor proper, which is even inappropriate in an implementation of SELECT-DISTINCT-that-matches-an-index IMO.) -- Peter Geoghegan
On Mon, Mar 28, 2022 at 5:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, what I was on about is _bt_preprocess_keys() and > adjacent code. I'm surprised that those aren't more expensive > than one palloc in _bt_first. Maybe that logic falls through very > quickly in simple cases, though. I assume that it doesn't really appear in very simple cases (also common cases). But delaying the scan setup work until execution time does seem ugly. That's probably a good enough reason to refactor. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > The terminology in this area is a mess. MySQL calls > SELECT-DISTINCT-that-matches-an-index "loose index scans". I think > that you're talking about skip scan when you say "loose index scan". > Skip scan is where there is an omitted prefix of columns in the SQL > query -- omitted columns after the first column that lack an equality > qual. Right, that's the case I had in mind --- apologies if my terminology was faulty. btree can actually handle such a case now, but what it fails to do is re-descend from the tree root instead of plowing forward in the index to find the next matching entry. > It might be useful for somebody to go write a "taxonomy of MDAM > techniques", or a glossary. +1. We at least need to be sure we all are using these terms the same way. regards, tom lane
On Mon, Mar 28, 2022 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Right, that's the case I had in mind --- apologies if my terminology > was faulty. btree can actually handle such a case now, but what it > fails to do is re-descend from the tree root instead of plowing > forward in the index to find the next matching entry. KNNGIST seems vaguely related to what we'd build for nbtree skip scan, though. GiST index scans are "inherently loose", though. KNNGIST uses a pairing heap/priority queue, which seems like the kind of thing nbtree skip scan can avoid. > +1. We at least need to be sure we all are using these terms > the same way. Yeah, there are *endless* opportunities for confusion here. -- Peter Geoghegan