Thread: Index Skip Scan
Hi all, I would like to start a discussion on Index Skip Scan referred to as Loose Index Scan in the wiki [1]. My use-case is the simplest form of Index Skip Scan (B-Tree only), namely going from CREATE TABLE t1 (a integer PRIMARY KEY, b integer); CREATE INDEX idx_t1_b ON t1 (b); INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 10000000) as i); ANALYZE; EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1; HashAggregate (cost=169247.71..169247.74 rows=3 width=4) (actual time=4104.099..4104.099 rows=3 loops=1) Output: b Group Key: t1.b Buffers: shared hit=44248 -> Seq Scan on public.t1 (cost=0.00..144247.77 rows=9999977 width=4) (actual time=0.059..1050.376 rows=10000000 loops=1) Output: a, b Buffers: shared hit=44248 Planning Time: 0.157 ms Execution Time: 4104.155 ms (9 rows) to CREATE TABLE t1 (a integer PRIMARY KEY, b integer); CREATE INDEX idx_t1_b ON t1 (b); INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 10000000) as i); ANALYZE; EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1; Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4) (actual time=0.061..0.137 rows=3 loops=1) Output: b Heap Fetches: 3 Buffers: shared hit=13 Planning Time: 0.155 ms Execution Time: 0.170 ms (6 rows) I took Thomas Munro's previous patch [2] on the subject, added a GUC, a test case, documentation hooks, minor code cleanups, and made the patch pass an --enable-cassert make check-world run. So, the overall design is the same. However, as Robert Haas noted in the thread there are issues with the patch as is, especially in relationship to the amcanbackward functionality. A couple of questions to begin with. Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should a new node (T_IndexSkipScan) be created ? If latter, then there likely will be functionality that needs to be refactored into shared code between the nodes. Which is the best way to deal with the amcanbackward functionality ? Do people see another alternative to Robert's idea of adding a flag to the scan. I wasn't planning on making this a patch submission for the July CommitFest due to the reasons mentioned above, but can do so if people thinks it is best. The patch is based on master/4c8156. Any feedback, suggestions, design ideas and help with the patch in general is greatly appreciated. Thanks in advance ! [1] https://wiki.postgresql.org/wiki/Loose_indexscan [2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com Best regards, Jesper
Attachment
Hi! On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > I would like to start a discussion on Index Skip Scan referred to as > Loose Index Scan in the wiki [1]. Great, I glad to see you working in this! > However, as Robert Haas noted in the thread there are issues with the > patch as is, especially in relationship to the amcanbackward functionality. > > A couple of questions to begin with. > > Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should > a new node (T_IndexSkipScan) be created ? If latter, then there likely > will be functionality that needs to be refactored into shared code > between the nodes. Is skip scan only possible for index-only scan? I guess, that no. We could also make plain index scan to behave like a skip scan. And it should be useful for accelerating DISTINCT ON clause. Thus, we might have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan, IndexOnlySkipScan. So, I don't think I like index scan nodes to multiply this way, and it would be probably better to keep number nodes smaller. But I don't insist on that, and I would like to hear other opinions too. > Which is the best way to deal with the amcanbackward functionality ? Do > people see another alternative to Robert's idea of adding a flag to the > scan. Supporting amcanbackward seems to be basically possible, but rather complicated and not very efficient. So, it seems to not worth implementing, at least in the initial version. And then the question should how index access method report that it supports both skip scan and backward scan, but not both together? What about turning amcanbackward into a function which takes (bool skipscan) argument? Therefore, this function will return whether backward scan is supported depending of whether skip scan is used. > I wasn't planning on making this a patch submission for the July > CommitFest due to the reasons mentioned above, but can do so if people > thinks it is best. The patch is based on master/4c8156. Please, register it on commitfest. If even there wouldn't be enough of time for this patch on July commitfest, it's no problem to move it. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 06/18/2018 11:25 AM, Jesper Pedersen wrote: > Hi all, > > I would like to start a discussion on Index Skip Scan referred to as > Loose Index Scan in the wiki [1]. awesome > > > > I wasn't planning on making this a patch submission for the July > CommitFest due to the reasons mentioned above, but can do so if people > thinks it is best. T New large features are not appropriate for the July CF. September should be your goal. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 18, 2018 at 11:20 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > On 06/18/2018 11:25 AM, Jesper Pedersen wrote: > > I wasn't planning on making this a patch submission for the July > > CommitFest due to the reasons mentioned above, but can do so if people > > thinks it is best. T > > New large features are not appropriate for the July CF. September should > be your goal. Assuming this, should we have possibility to register patch to September CF from now? ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote: > Assuming this, should we have possibility to register patch to > September CF from now? There cannot be two commit fests marked as open at the same time as Magnus mentions here: https://www.postgresql.org/message-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.com There is no issue in registering new patches in future ones for admins, but normal users cannot, right? In this case, could you wait that the next CF is marked as in progress and that the one of September is opened? You could also add it to the July one if you are not willing to wait, and that will get bumped by one of the CFMs, but this makes the whole process unnecessary noisy. -- Michael
Attachment
> On 18 June 2018 at 19:31, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> >> A couple of questions to begin with. >> >> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should >> a new node (T_IndexSkipScan) be created ? If latter, then there likely >> will be functionality that needs to be refactored into shared code >> between the nodes. > > Is skip scan only possible for index-only scan? I guess, that no. We > could also make plain index scan to behave like a skip scan. And it > should be useful for accelerating DISTINCT ON clause. Thus, we might > have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan, > IndexOnlySkipScan. So, I don't think I like index scan nodes to > multiply this way, and it would be probably better to keep number > nodes smaller. But I don't insist on that, and I would like to hear > other opinions too. In one of patches I'm working on I had similar situation, when I wanted to split one node into two similar nodes (before I just extended it) and logically it made perfect sense. But it turned out to be quite useless and the advantage I've got wasn't worth it - and just to mention, those nodes had more differences than in this patch. So I agree that probably it would be better to keep using IndexOnlyScan. > On 19 June 2018 at 03:40, Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote: >> Assuming this, should we have possibility to register patch to >> September CF from now? > > There cannot be two commit fests marked as open at the same time as > Magnus mentions here: > https://www.postgresql.org/message-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.com > > In this case, could you wait that the next CF is marked as in progress and > that the one of September is opened? Yep, since the next CF will start shortly that's the easiest thing to do.
Hi Alexander, On 06/18/2018 01:31 PM, Alexander Korotkov wrote: > <jesper.pedersen@redhat.com> wrote: >> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should >> a new node (T_IndexSkipScan) be created ? If latter, then there likely >> will be functionality that needs to be refactored into shared code >> between the nodes. > > Is skip scan only possible for index-only scan? I guess, that no. We > could also make plain index scan to behave like a skip scan. And it > should be useful for accelerating DISTINCT ON clause. Thus, we might > have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan, > IndexOnlySkipScan. So, I don't think I like index scan nodes to > multiply this way, and it would be probably better to keep number > nodes smaller. But I don't insist on that, and I would like to hear > other opinions too. > Yes, there are likely other use-cases for Index Skip Scan apart from the simplest form. Which sort of suggests that having dedicated nodes would be better in the long run. My goal is to cover the simplest form, which can be handled by extending the T_IndexOnlyScan node, or by having common functions that both use. We can always improve the functionality with future patches. >> Which is the best way to deal with the amcanbackward functionality ? Do >> people see another alternative to Robert's idea of adding a flag to the >> scan. > > Supporting amcanbackward seems to be basically possible, but rather > complicated and not very efficient. So, it seems to not worth > implementing, at least in the initial version. > And then the question > should how index access method report that it supports both skip scan > and backward scan, but not both together? What about turning > amcanbackward into a function which takes (bool skipscan) argument? > Therefore, this function will return whether backward scan is > supported depending of whether skip scan is used. > The feedback from Robert Haas seems to suggest that it was a requirement for the patch to be considered. >> I wasn't planning on making this a patch submission for the July >> CommitFest due to the reasons mentioned above, but can do so if people >> thinks it is best. The patch is based on master/4c8156. > > Please, register it on commitfest. If even there wouldn't be enough > of time for this patch on July commitfest, it's no problem to move it. > Based on the feedback from Andrew and Michael I won't register this thread until the September CF. Thanks for your feedback ! Best regards, Jesper
Hi Dmitry, On 06/19/2018 06:01 AM, Dmitry Dolgov wrote: >> On 18 June 2018 at 19:31, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> Is skip scan only possible for index-only scan? I guess, that no. We >> could also make plain index scan to behave like a skip scan. And it >> should be useful for accelerating DISTINCT ON clause. Thus, we might >> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan, >> IndexOnlySkipScan. So, I don't think I like index scan nodes to >> multiply this way, and it would be probably better to keep number >> nodes smaller. But I don't insist on that, and I would like to hear >> other opinions too. > > In one of patches I'm working on I had similar situation, when I wanted to > split one node into two similar nodes (before I just extended it) and logically > it made perfect sense. But it turned out to be quite useless and the advantage > I've got wasn't worth it - and just to mention, those nodes had more differences > than in this patch. So I agree that probably it would be better to keep using > IndexOnlyScan. > I looked at this today, and creating a new node (T_IndexOnlySkipScan) would make the patch more complex. The question is if the patch should create such a node such that future patches didn't have to deal with refactoring to a new node to cover additional functionality. Thanks for your feedback ! Best regards, Jesper
Hello Jesper, I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is equivalentto “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it possiblethat index skip scan can address it as well? postgres=# explain verbose select b from t1 group by b; QUERY PLAN ---------------------------------------------------------------------------------------------------- Group (cost=97331.29..97332.01 rows=3 width=4) Output: b Group Key: t1.b -> Gather Merge (cost=97331.29..97331.99 rows=6 width=4) Output: b Workers Planned: 2 -> Sort (cost=96331.27..96331.27 rows=3 width=4) Output: b Sort Key: t1.b -> Partial HashAggregate (cost=96331.21..96331.24 rows=3 width=4) Output: b Group Key: t1.b -> Parallel Seq Scan on public.t1 (cost=0.00..85914.57 rows=4166657 width=4) Output: a, b (14 rows) Time: 1.167 ms — And here is the original example postgres=# explain verbose SELECT DISTINCT b FROM t1; QUERY PLAN ------------------------------------------------------------------------------- Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4) Output: b (2 rows) Time: 0.987 ms > On Jun 18, 2018, at 10:31 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > Hi! > > On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> I would like to start a discussion on Index Skip Scan referred to as >> Loose Index Scan in the wiki [1]. > > Great, I glad to see you working in this! > >> However, as Robert Haas noted in the thread there are issues with the >> patch as is, especially in relationship to the amcanbackward functionality. >> >> A couple of questions to begin with. >> >> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should >> a new node (T_IndexSkipScan) be created ? If latter, then there likely >> will be functionality that needs to be refactored into shared code >> between the nodes. > > Is skip scan only possible for index-only scan? I guess, that no. We > could also make plain index scan to behave like a skip scan. And it > should be useful for accelerating DISTINCT ON clause. Thus, we might > have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan, > IndexOnlySkipScan. So, I don't think I like index scan nodes to > multiply this way, and it would be probably better to keep number > nodes smaller. But I don't insist on that, and I would like to hear > other opinions too. > >> Which is the best way to deal with the amcanbackward functionality ? Do >> people see another alternative to Robert's idea of adding a flag to the >> scan. > > Supporting amcanbackward seems to be basically possible, but rather > complicated and not very efficient. So, it seems to not worth > implementing, at least in the initial version. And then the question > should how index access method report that it supports both skip scan > and backward scan, but not both together? What about turning > amcanbackward into a function which takes (bool skipscan) argument? > Therefore, this function will return whether backward scan is > supported depending of whether skip scan is used. > >> I wasn't planning on making this a patch submission for the July >> CommitFest due to the reasons mentioned above, but can do so if people >> thinks it is best. The patch is based on master/4c8156. > > Please, register it on commitfest. If even there wouldn't be enough > of time for this patch on July commitfest, it's no problem to move it. > > ------ > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > >
On Thu, Aug 16, 2018 at 5:44 PM, Bhushan Uparkar <bhushan.uparkar@gmail.com> wrote: > I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is equivalentto “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it possiblethat index skip scan can address it as well? Yeah, there are a few tricks you can do with "index skip scans" (Oracle name, or as IBM calls them, "index jump scans"... I was slightly tempted to suggest we call ours "index hop scans"...). For example: * groups and certain aggregates (MIN() and MAX() of suffix index columns within each group) * index scans where the scan key doesn't include the leading columns (but you expect there to be sufficiently few values) * merge joins (possibly the trickiest and maybe out of range) You're right that a very simple GROUP BY can be equivalent to a DISTINCT query, but I'm not sure if it's worth recognising that directly or trying to implement the more general grouping trick that can handle MIN/MAX, and whether that should be the same executor node... The idea of starting with DISTINCT was just that it's comparatively easy. We should certainly try to look ahead and bear those features in mind when figuring out the interfaces though. Would the indexam skip(scan, direction, prefix_size) operation I proposed be sufficient? Is there a better way? I'm glad to see this topic come back! -- Thomas Munro http://www.enterprisedb.com
Hi Bhushan, On 08/16/2018 01:44 AM, Bhushan Uparkar wrote: > I was reviewing index-skip patch example and have a comment on it. Thanks for your interest, and feedback on this patch ! > Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form > ofquery it came up with different plan, is it possible that index skip scan can address it as well? > Like Thomas commented down-thread my goal is to keep this contribution as simple as possible in order to get to something that can be committed. Improvements can follow in future CommitFests, which may end up in the same release. However, as stated in my original mail my goal is the simplest form of Index Skip Scan (or whatever we call it). I welcome any help with the patch. Best regards, Jesper
Hi Thomas, On 08/16/2018 02:22 AM, Thomas Munro wrote: > The idea of starting with DISTINCT was just that it's > comparatively easy. We should certainly try to look ahead and bear > those features in mind when figuring out the interfaces though. Would > the indexam skip(scan, direction, prefix_size) operation I proposed be > sufficient? Is there a better way? > Yeah, I'm hoping that a Committer can provide some feedback on the direction that this patch needs to take. One thing to consider is the pluggable storage patch, which is a lot more important than this patch. I don't want this patch to get in the way of that work, so it may have to wait a bit in order to see any new potential requirements. > I'm glad to see this topic come back! > You did the work, and yes hopefully we can get closer to this subject in 12 :) Best regards, Jesper
Greetings, * Jesper Pedersen (jesper.pedersen@redhat.com) wrote: > On 08/16/2018 02:22 AM, Thomas Munro wrote: > >The idea of starting with DISTINCT was just that it's > >comparatively easy. We should certainly try to look ahead and bear > >those features in mind when figuring out the interfaces though. Would > >the indexam skip(scan, direction, prefix_size) operation I proposed be > >sufficient? Is there a better way? > > Yeah, I'm hoping that a Committer can provide some feedback on the direction > that this patch needs to take. Thomas is one these days. :) At least on first glance, that indexam seems to make sense to me, but I've not spent a lot of time thinking about it. Might be interesting to ask Peter G about it though. > One thing to consider is the pluggable storage patch, which is a lot more > important than this patch. I don't want this patch to get in the way of that > work, so it may have to wait a bit in order to see any new potential > requirements. Not sure where this came from, but I don't think it's particularly good to be suggesting that one feature is more important than another or that we need to have one wait for another as this seems to imply. I'd certainly really like to see PG finally have skipping scans, for one thing, and it seems like with some effort that might be able to happen for v12. I'm not convinced that we're going to get pluggable storage to happen in v12 and I don't really agree with recommending that people hold off on making improvements to things because it's coming. Thanks! Stephen
Attachment
Hi Stephen, On 08/16/2018 02:36 PM, Stephen Frost wrote: >> Yeah, I'm hoping that a Committer can provide some feedback on the direction >> that this patch needs to take. > > Thomas is one these days. :) > I know :) However, there are some open questions from Thomas' original submission that still needs to be ironed out. > At least on first glance, that indexam seems to make sense to me, but > I've not spent a lot of time thinking about it. Might be interesting to > ask Peter G about it though. > Yes, or Anastasia who also have done a lot of work on nbtree/. >> One thing to consider is the pluggable storage patch, which is a lot more >> important than this patch. I don't want this patch to get in the way of that >> work, so it may have to wait a bit in order to see any new potential >> requirements. > > Not sure where this came from, but I don't think it's particularly good > to be suggesting that one feature is more important than another or that > we need to have one wait for another as this seems to imply. I'd > certainly really like to see PG finally have skipping scans, for one > thing, and it seems like with some effort that might be able to happen > for v12. I'm not convinced that we're going to get pluggable storage to > happen in v12 and I don't really agree with recommending that people > hold off on making improvements to things because it's coming. > My point was that I know this patch needs work, so any feedback that get it closer to a solution will help. Pluggable storage may / may not add new requirements, but it is up to the people working on that, some of which are Committers, to take time "off" to provide feedback for this patch in order steer me in right direction. Work can happen in parallel, and I'm definitely not recommending that people hold off on any patches that they want to provide feedback for, or submit for a CommitFest. Best regards, Jesper
Greetings, * Jesper Pedersen (jesper.pedersen@redhat.com) wrote: > On 08/16/2018 02:36 PM, Stephen Frost wrote: > >Not sure where this came from, but I don't think it's particularly good > >to be suggesting that one feature is more important than another or that > >we need to have one wait for another as this seems to imply. I'd > >certainly really like to see PG finally have skipping scans, for one > >thing, and it seems like with some effort that might be able to happen > >for v12. I'm not convinced that we're going to get pluggable storage to > >happen in v12 and I don't really agree with recommending that people > >hold off on making improvements to things because it's coming. > > My point was that I know this patch needs work, so any feedback that get it > closer to a solution will help. > > Pluggable storage may / may not add new requirements, but it is up to the > people working on that, some of which are Committers, to take time "off" to > provide feedback for this patch in order steer me in right direction. I don't think it's really necessary for this work to be suffering under some concern that pluggable storage will make it have to change. Sure, it might, but it also very well might not. For my 2c, anyway, this seems likely to get into the tree before pluggable storage does and it's pretty unlikely to be the only thing that that work will need to be prepared to address when it happens. > Work can happen in parallel, and I'm definitely not recommending that people > hold off on any patches that they want to provide feedback for, or submit > for a CommitFest. Yes, work can happen in parallel, and I don't really think there needs to be a concern about some other patch set when it comes to getting this patch committed. Thanks! Stephen
Attachment
On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Yeah, there are a few tricks you can do with "index skip scans" > (Oracle name, or as IBM calls them, "index jump scans"... I was > slightly tempted to suggest we call ours "index hop scans"...). Hopscotch scans? > * groups and certain aggregates (MIN() and MAX() of suffix index > columns within each group) > * index scans where the scan key doesn't include the leading columns > (but you expect there to be sufficiently few values) > * merge joins (possibly the trickiest and maybe out of range) FWIW, I suspect that we're going to have the biggest problems in the optimizer. It's not as if ndistinct is in any way reliable. That may matter more on average than it has with other path types. -- Peter Geoghegan
On August 16, 2018 8:28:45 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>One thing to consider is the pluggable storage patch, which is a lot
>more important than this patch. I don't want this patch to get in the
>way of that work, so it may have to wait a bit in order to see any new
>potential requirements.
I don't think this would be a meaningful, relative to the size of the patch sets, amount of conflict between the two. So I don't think we have to consider relative importance (which I don't think is that easy to assess in this case).
Fwiw, I've a significantly further revised version of the tableam patch that I plan to send in a few days. Ported the current zheap patch as a separate AM which helped weed out a lot of issues.
Andres
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Aug 17, 2018 at 7:48 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> * groups and certain aggregates (MIN() and MAX() of suffix index >> columns within each group) >> * index scans where the scan key doesn't include the leading columns >> (but you expect there to be sufficiently few values) >> * merge joins (possibly the trickiest and maybe out of range) > > FWIW, I suspect that we're going to have the biggest problems in the > optimizer. It's not as if ndistinct is in any way reliable. That may > matter more on average than it has with other path types. Can you give an example of problematic ndistinct underestimation? I suppose you might be able to defend against that in the executor: if you find that you've done an unexpectedly high number of skips, you could fall back to regular next-tuple mode. Unfortunately that's require the parent plan node to tolerate non-unique results. I noticed that the current patch doesn't care about restrictions on the range (SELECT DISTINCT a FROM t WHERE a BETWEEN 500 and 600), but that causes it to overestimate the number of btree searches, which is a less serious problem (it might not chose a skip scan when it would have been better). -- Thomas Munro http://www.enterprisedb.com
Hi Peter, On 08/16/2018 03:48 PM, Peter Geoghegan wrote: > On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> * groups and certain aggregates (MIN() and MAX() of suffix index >> columns within each group) >> * index scans where the scan key doesn't include the leading columns >> (but you expect there to be sufficiently few values) >> * merge joins (possibly the trickiest and maybe out of range) > > FWIW, I suspect that we're going to have the biggest problems in the > optimizer. It's not as if ndistinct is in any way reliable. That may > matter more on average than it has with other path types. > Thanks for sharing this; it is very useful to know. Best regards, Jesper
On Thu, Aug 16, 2018 at 4:10 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Can you give an example of problematic ndistinct underestimation? Yes. See https://postgr.es/m/CAKuK5J12QokFh88tQz-oJMSiBg2QyjM7K7HLnbYi3Ze+Y5BtWQ@mail.gmail.com, for example. That's a complaint about an underestimation specifically. This seems to come up about once every 3 years, at least from my perspective. I'm always surprised that ndistinct doesn't get implicated in bad query plans more frequently. > I suppose you might be able to defend against that in the executor: if > you find that you've done an unexpectedly high number of skips, you > could fall back to regular next-tuple mode. Unfortunately that's > require the parent plan node to tolerate non-unique results. I like the idea of dynamic fallback in certain situations, but the details always seem complicated. -- Peter Geoghegan
> On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > I took Thomas Munro's previous patch [2] on the subject, added a GUC, a > test case, documentation hooks, minor code cleanups, and made the patch > pass an --enable-cassert make check-world run. So, the overall design is > the same. I've looked through the patch more closely, and have a few questions: * Is there any reason why only copy function for the IndexOnlyScan node includes an implementation for distinctPrefix? Without read/out functionality skip doesn't work for parallel scans, so it becomes like that: =# SET force_parallel_mode TO ON; =# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1; QUERY PLAN ------------------------------------------------------------------------------- Gather (cost=1000.43..1001.60 rows=3 width=4) (actual time=11.054..17672.010 rows=10000000 loops=1) Output: b Workers Planned: 1 Workers Launched: 1 Single Copy: true Buffers: shared hit=91035 read=167369 -> Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4) (actual time=1.350..16065.165 rows=10000000 loops=1) Output: b Heap Fetches: 10000000 Buffers: shared hit=91035 read=167369 Worker 0: actual time=1.350..16065.165 rows=10000000 loops=1 Buffers: shared hit=91035 read=167369 Planning Time: 0.394 ms Execution Time: 6037.800 ms and with this functionality it gets better: =# SET force_parallel_mode TO ON; =# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1; QUERY PLAN ------------------------------------------------------------------------------- Gather (cost=1000.43..1001.60 rows=3 width=4) (actual time=3.564..4.427 rows=3 loops=1) Output: b Workers Planned: 1 Workers Launched: 1 Single Copy: true Buffers: shared hit=4 read=10 -> Index Skip Scan using idx_t1_b on public.t1 (cost=0.43..1.30 rows=3 width=4) (actual time=0.065..0.133 rows=3 loops=1) Output: b Heap Fetches: 3 Buffers: shared hit=4 read=10 Worker 0: actual time=0.065..0.133 rows=3 loops=1 Buffers: shared hit=4 read=10 Planning Time: 1.724 ms Execution Time: 4.522 ms * What is the purpose of HeapFetches? I don't see any usage of this variable except assigning 0 to it once.
Hi Dmitry, On 9/10/18 5:47 PM, Dmitry Dolgov wrote: >> On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > I've looked through the patch more closely, and have a few questions: > Thanks for your review ! > * Is there any reason why only copy function for the IndexOnlyScan node > includes an implementation for distinctPrefix? Oversight -- thanks for catching that. > * What is the purpose of HeapFetches? I don't see any usage of this variable > except assigning 0 to it once. > That can be removed. New version WIP v2 attached. Best regards, Jesper
Attachment
Hi Jesper,
While testing this patch I noticed that current implementation doesn't perform well when we have lots of small groups of equal values. Here is the execution time of index skip scan vs unique over index scan, in ms, depending on the size of group. The benchmark script is attached.
1 2,293.85 132.55
5 464.40 106.59
10 239.61 102.02
50 56.59 98.74
100 32.56 103.04
500 6.08 97.09
So, the current implementation can lead to performance regression, and the choice of the plan depends on the notoriously unreliable ndistinct statistics. The regression is probably because skip scan always does _bt_search to find the next unique tuple. I think we can improve this, and the skip scan can be strictly faster than index scan regardless of the data. As a first approximation, imagine that we somehow skipped equal tuples inside _bt_next instead of sending them to the parent Unique node. This would already be marginally faster than Unique + Index scan. A more practical implementation would be to remember our position in tree (that is, BTStack returned by _bt_search) and use it to skip pages in bulk. This looks straightforward to implement for a tree that does not change, but I'm not sure how to make it work with concurrent modifications. Still, this looks a worthwhile direction to me, because if we have a strictly faster skip scan, we can just use it always and not worry about our unreliable statistics. What do you think?--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hi Alexander. On 9/13/18 9:01 AM, Alexander Kuzmenkov wrote: > While testing this patch Thanks for the review ! > I noticed that current implementation doesn't > perform well when we have lots of small groups of equal values. Here is > the execution time of index skip scan vs unique over index scan, in ms, > depending on the size of group. The benchmark script is attached. > > group size skip unique > 1 2,293.85 132.55 > 5 464.40 106.59 > 10 239.61 102.02 > 50 56.59 98.74 > 100 32.56 103.04 > 500 6.08 97.09 > Yes, this doesn't look good. Using your test case I'm seeing that unique is being chosen when the group size is below 34, and skip above. This is with the standard initdb configuration; did you change something else ? Or did you force the default plan ? > So, the current implementation can lead to performance regression, and > the choice of the plan depends on the notoriously unreliable ndistinct > statistics. Yes, Peter mentioned this, which I'm still looking at. > The regression is probably because skip scan always does > _bt_search to find the next unique tuple. Very likely. > I think we can improve this, > and the skip scan can be strictly faster than index scan regardless of > the data. As a first approximation, imagine that we somehow skipped > equal tuples inside _bt_next instead of sending them to the parent > Unique node. This would already be marginally faster than Unique + Index > scan. A more practical implementation would be to remember our position > in tree (that is, BTStack returned by _bt_search) and use it to skip > pages in bulk. This looks straightforward to implement for a tree that > does not change, but I'm not sure how to make it work with concurrent > modifications. Still, this looks a worthwhile direction to me, because > if we have a strictly faster skip scan, we can just use it always and > not worry about our unreliable statistics. What do you think? > This is something to look at -- maybe there is a way to use btpo_next/btpo_prev instead/too in order to speed things up. Atm we just have the scan key in BTScanOpaqueData. I'll take a look after my upcoming vacation; feel free to contribute those changes in the meantime of course. Thanks again ! Best regards, Jesper
El 13/09/18 a las 18:39, Jesper Pedersen escribió: > > Yes, this doesn't look good. Using your test case I'm seeing that > unique is being chosen when the group size is below 34, and skip > above. This is with the standard initdb configuration; did you change > something else ? Or did you force the default plan ? Sorry I didn't mention this, the first column is indeed forced skip scan, just to see how it compares to index scan. > This is something to look at -- maybe there is a way to use > btpo_next/btpo_prev instead/too in order to speed things up. Atm we > just have the scan key in BTScanOpaqueData. I'll take a look after my > upcoming vacation; feel free to contribute those changes in the > meantime of course. > I probably won't be able to contribute the changes, but I'll try to review them. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote: > > El 13/09/18 a las 18:39, Jesper Pedersen escribió: > >> I think we can improve this, >> and the skip scan can be strictly faster than index scan regardless of >> the data. As a first approximation, imagine that we somehow skipped >> equal tuples inside _bt_next instead of sending them to the parent >> Unique node. This would already be marginally faster than Unique + Index >> scan. A more practical implementation would be to remember our position >> in tree (that is, BTStack returned by _bt_search) and use it to skip >> pages in bulk. This looks straightforward to implement for a tree that >> does not change, but I'm not sure how to make it work with concurrent >> modifications. Still, this looks a worthwhile direction to me, because >> if we have a strictly faster skip scan, we can just use it always and >> not worry about our unreliable statistics. What do you think? >> > > This is something to look at -- maybe there is a way to use > btpo_next/btpo_prev instead/too in order to speed things up. Atm we just > have the scan key in BTScanOpaqueData. I'll take a look after my > upcoming vacation; feel free to contribute those changes in the meantime > of course. But having this logic inside _bt_next means that it will make a non-skip index only scan a bit slower, am I right? Probably it would be easier and more straightforward to go with the idea of dynamic fallback then. The first naive implementation that I came up with is to wrap an index scan node into a unique, and remember estimated number of groups into IndexOnlyScanState, so that we can check if we performed much more skips than expected. With this approach index skip scan will work a bit slower than in the original patch in case if ndistinct is correct (because a unique node will recheck rows we returned), and fallback to unique + index only scan in case if planner has underestimated ndistinct.
Attachment
Hi Dmitry, On 9/15/18 3:52 PM, Dmitry Dolgov wrote: >> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote: >> El 13/09/18 a las 18:39, Jesper Pedersen escribió: >>> I think we can improve this, >>> and the skip scan can be strictly faster than index scan regardless of >>> the data. As a first approximation, imagine that we somehow skipped >>> equal tuples inside _bt_next instead of sending them to the parent >>> Unique node. This would already be marginally faster than Unique + Index >>> scan. A more practical implementation would be to remember our position >>> in tree (that is, BTStack returned by _bt_search) and use it to skip >>> pages in bulk. This looks straightforward to implement for a tree that >>> does not change, but I'm not sure how to make it work with concurrent >>> modifications. Still, this looks a worthwhile direction to me, because >>> if we have a strictly faster skip scan, we can just use it always and >>> not worry about our unreliable statistics. What do you think? >>> >> >> This is something to look at -- maybe there is a way to use >> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just >> have the scan key in BTScanOpaqueData. I'll take a look after my >> upcoming vacation; feel free to contribute those changes in the meantime >> of course. > > But having this logic inside _bt_next means that it will make a non-skip index > only scan a bit slower, am I right? Correct. > Probably it would be easier and more > straightforward to go with the idea of dynamic fallback then. The first naive > implementation that I came up with is to wrap an index scan node into a unique, > and remember estimated number of groups into IndexOnlyScanState, so that we can > check if we performed much more skips than expected. With this approach index > skip scan will work a bit slower than in the original patch in case if > ndistinct is correct (because a unique node will recheck rows we returned), and > fallback to unique + index only scan in case if planner has underestimated > ndistinct. > I think we need a comment on this in the patch, as 10 * node->ioss_PlanRows looks a bit random. Thanks for your contribution ! Best regards, Jesper
> On Thu, 27 Sep 2018 at 15:59, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > I think we need a comment on this in the patch, as 10 * > node->ioss_PlanRows looks a bit random. Yeah, you're right, it's a bit arbitrary number - we just need to make sure that this estimation is not too small (to avoid false positives), but also not too big (to not miss a proper point for fallback). I left it uncommented mostly because I wanted to get some feedback on it first, and probably some suggestions about how to make this estimation better.
Hi I tested last patch and I have some notes: 1. postgres=# explain select distinct a10000 from foo; +-------------------------------------------------------------------------------------------+ | QUERY PLAN | +-------------------------------------------------------------------------------------------+ | Unique (cost=0.43..4367.56 rows=9983 width=4) | | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) | +-------------------------------------------------------------------------------------------+ (2 rows) In this case Unique node is useless and can be removed 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport 3. Once time patched postgres crashed, but I am not able to reproduce it. Looks like very interesting patch, and important for some BI platforms
Hi Pavel, On 10/9/18 9:42 AM, Pavel Stehule wrote: > I tested last patch and I have some notes: > > 1. > > postgres=# explain select distinct a10000 from foo; > +-------------------------------------------------------------------------------------------+ > | QUERY PLAN | > +-------------------------------------------------------------------------------------------+ > | Unique (cost=0.43..4367.56 rows=9983 width=4) | > | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) | > +-------------------------------------------------------------------------------------------+ > (2 rows) > > In this case Unique node is useless and can be removed > > 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport > > 3. Once time patched postgres crashed, but I am not able to reproduce it. > Please, send that query through if you can replicate it. The patch currently passes an assert'ed check-world, so your query clearly triggered something that isn't covered yet. > Looks like very interesting patch, and important for some BI platforms > Thanks for your review ! Best regards, Jesper
> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > Hi > > I tested last patch and I have some notes: > > 1. > > postgres=# explain select distinct a10000 from foo; > +-------------------------------------------------------------------------------------------+ > | QUERY PLAN | > +-------------------------------------------------------------------------------------------+ > | Unique (cost=0.43..4367.56 rows=9983 width=4) | > | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) | > +-------------------------------------------------------------------------------------------+ > (2 rows) > > In this case Unique node is useless and can be removed Just to clarify which exactly version were you testing? If index-skip-fallback.patch, then the Unique node was added there to address the situation when ndistinct is underestimated, with an idea to fallback to original plan (and to tolerate that I suggested to use Unique, since we don't know if fallback will happen or not during the planning). > 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport Yep, as far as I understand MIN/MAX is going to be the next step after this patch will be accepted. > 3. Once time patched postgres crashed, but I am not able to reproduce it. Maybe you have at least some ideas what could cause that or what's the possible way to reproduce that doesn't work anymore?
> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> | QUERY PLAN |
> +-------------------------------------------------------------------------------------------+
> | Unique (cost=0.43..4367.56 rows=9983 width=4) |
> | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed
Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).
> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.
> 3. Once time patched postgres crashed, but I am not able to reproduce it.
Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?
út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6@gmail.com> napsal:> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> | QUERY PLAN |
> +-------------------------------------------------------------------------------------------+
> | Unique (cost=0.43..4367.56 rows=9983 width=4) |
> | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed
Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).I tested index-skip-fallback.patch.It looks like good idea, but then the node should be named "index scan" and other info can be displayed in detail parts. It can be similar like "sort".The combination of unique and index skip scan looks strange :)
> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.okNow, the development cycle is starting - maybe it can use same infrastructure like MIN/MAX and this part can be short.more if you use dynamic index scan
> 3. Once time patched postgres crashed, but I am not able to reproduce it.
Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?I think it was query likeselect count(*) from (select distinct x from tab) s
> On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel.stehule@gmail.com> wrote: > >> It looks like good idea, but then the node should be named "index scan" and >> other info can be displayed in detail parts. It can be similar like "sort". >> The combination of unique and index skip scan looks strange :) > > maybe we don't need special index skip scan node - maybe possibility to > return unique values from index scan node can be good enough - some like > "distinct index scan" - and the implementation there can be dynamic -skip > scan, classic index scan, > > "index skip scan" is not good name if the implementaion is dynamic. Yeah, that's a valid point. The good part is that index skip scan is not really a separate node, but just enhanced index only scan node. So indeed maybe it would be better to call it Index Only Scan, but show in details that we apply the skip scan strategy. Any other opinions about this? >> I think it was query like >> select count(*) from (select distinct x from tab) s Thanks, I'll take a look.
On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > I noticed that current implementation doesn't > > perform well when we have lots of small groups of equal values. Here is > > the execution time of index skip scan vs unique over index scan, in ms, > > depending on the size of group. The benchmark script is attached. > > > > group size skip unique > > 1 2,293.85 132.55 > > 5 464.40 106.59 > > 10 239.61 102.02 > > 50 56.59 98.74 > > 100 32.56 103.04 > > 500 6.08 97.09 > > > > Yes, this doesn't look good. Using your test case I'm seeing that unique > is being chosen when the group size is below 34, and skip above. I'm not sure exactly how the current patch is approaching the problem, but it seems like it might be reasonable to do something like -- look for a distinct item within the current page; if not found, then search from the root of the tree for the next item > the current item. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Fri, 12 Oct 2018 at 19:44, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: > > > I noticed that current implementation doesn't > > > perform well when we have lots of small groups of equal values. Here is > > > the execution time of index skip scan vs unique over index scan, in ms, > > > depending on the size of group. The benchmark script is attached. > > > > > > group size skip unique > > > 1 2,293.85 132.55 > > > 5 464.40 106.59 > > > 10 239.61 102.02 > > > 50 56.59 98.74 > > > 100 32.56 103.04 > > > 500 6.08 97.09 > > > > > > > Yes, this doesn't look good. Using your test case I'm seeing that unique > > is being chosen when the group size is below 34, and skip above. > > I'm not sure exactly how the current patch is approaching the problem, > but it seems like it might be reasonable to do something like -- look > for a distinct item within the current page; if not found, then search > from the root of the tree for the next item > the current item. I'm not sure that I understand it correctly, can you elaborate please? From what I see it's quite similar to what's already implemented - we look for a distinct item on the page, and then search the index tree for a next distinct item. > On Wed, 10 Oct 2018 at 17:34, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > >> It looks like good idea, but then the node should be named "index scan" and > >> other info can be displayed in detail parts. It can be similar like "sort". > >> The combination of unique and index skip scan looks strange :) > > > > maybe we don't need special index skip scan node - maybe possibility to > > return unique values from index scan node can be good enough - some like > > "distinct index scan" - and the implementation there can be dynamic -skip > > scan, classic index scan, > > > > "index skip scan" is not good name if the implementaion is dynamic. > > Yeah, that's a valid point. The good part is that index skip scan is not really > a separate node, but just enhanced index only scan node. So indeed maybe it > would be better to call it Index Only Scan, but show in details that we apply > the skip scan strategy. Any other opinions about this? To make it more clean what I mean, see attached version of the patch. > >> I think it was query like > >> select count(*) from (select distinct x from tab) s > > Thanks, I'll take a look. I couldn't reproduce it either yet.
Attachment
Hello Last published patch index-skip-fallback-v2 applied and builds cleanly. I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795) > create table tablename (i int primary key); > select distinct i from tablename where i = 1; Query is obviously strange, but this is bug. Also i noticed two TODO in documentation. regards, Sergei
> On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk@zsrv.org> wrote: > > I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795) > > create table tablename (i int primary key); > > select distinct i from tablename where i = 1; > Query is obviously strange, but this is bug. Wow, thanks a lot! I can reproduce it too, will fix it.
> On Mon, 12 Nov 2018 at 13:55, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk@zsrv.org> wrote: > > > > I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795) > > > create table tablename (i int primary key); > > > select distinct i from tablename where i = 1; > > Query is obviously strange, but this is bug. > > Wow, thanks a lot! I can reproduce it too, will fix it. Yep, we had to check number of distinct columns too, here is the fixed patch (with a bit more verbose commit message).
Attachment
On 9/27/18 16:59, Jesper Pedersen wrote: > Hi Dmitry, > > On 9/15/18 3:52 PM, Dmitry Dolgov wrote: >>> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov >>> <a.kuzmenkov@postgrespro.ru> wrote: >>> El 13/09/18 a las 18:39, Jesper Pedersen escribió: >>>> I think we can improve this, >>>> and the skip scan can be strictly faster than index scan regardless of >>>> the data. >>>> <...> >>>> >>> >>> This is something to look at -- maybe there is a way to use >>> btpo_next/btpo_prev instead/too in order to speed things up. Atm we >>> just >>> have the scan key in BTScanOpaqueData. I'll take a look after my >>> upcoming vacation; feel free to contribute those changes in the >>> meantime >>> of course. >> >> But having this logic inside _bt_next means that it will make a >> non-skip index >> only scan a bit slower, am I right? > > Correct. Well, it depends on how the skip scan is implemented. We don't have to make normal scans slower, because skip scan is just a separate thing. My main point was that current implementation is good as a proof of concept, but it is inefficient for some data and needs some unreliable planner logic to work around this inefficiency. And now we also have execution-time fallback because planning-time fallback isn't good enough. This looks overly complicated. Let's try to design an algorithm that works well regardless of the particular data and doesn't need these heuristics. It should be possible to do so for btree. Of course, I understand the reluctance to implement an entire new type of btree traversal. Anastasia Lubennikova suggested a tweak for the current method that may improve the performance for small groups of equal values. When we search for the next unique key, first check if it is contained on the current btree page using its 'high key'. If it is, find it on the current page. If not, search from the root of the tree like we do now. This should improve the performance for small equal groups, because there are going to be several such groups on the page. And this is exactly where we have the regression compared to unique + index scan. By the way, what is the data for which we intend this feature to work? Obviously a non-unique btree index, but how wide are the tuples, and how big the equal groups? It would be good to have some real-world examples. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> Is skip scan only possible for index-only scan? I haven't see much discussion of this question yet. Is there a particular reason to lock ourselves into thinking about this only in an index only scan? >> I think we can improve this, >> and the skip scan can be strictly faster than index scan regardless of >> the data. As a first approximation, imagine that we somehow skipped >> equal tuples inside _bt_next instead of sending them to the parent >> Unique node. This would already be marginally faster than Unique + Index >> scan. A more practical implementation would be to remember our position >> in tree (that is, BTStack returned by _bt_search) and use it to skip >> pages in bulk. This looks straightforward to implement for a tree that >> does not change, but I'm not sure how to make it work with concurrent >> modifications. Still, this looks a worthwhile direction to me, because >> if we have a strictly faster skip scan, we can just use it always and >> not worry about our unreliable statistics. What do you think? >> > > This is something to look at -- maybe there is a way to use > btpo_next/btpo_prev instead/too in order to speed things up. Atm we just > have the scan key in BTScanOpaqueData. I'll take a look after my > upcoming vacation; feel free to contribute those changes in the meantime > of course. It seems to me also that the logic necessary for this kind of traversal has other useful applications. For example, it should be possible to build on that logic to allow and index like t(owner_fk, created_at) to be used to execute the following query: select * from t where owner_fk in (1,2,3) order by created_at limit 25 without needing to fetch all tuples satisfying "owner_fk in (1,2,3)" and subsequently sorting them.
Hi, On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote: >>> But having this logic inside _bt_next means that it will make a >>> non-skip index >>> only scan a bit slower, am I right? >> >> Correct. > > Well, it depends on how the skip scan is implemented. We don't have to > make normal scans slower, because skip scan is just a separate thing. > > My main point was that current implementation is good as a proof of > concept, but it is inefficient for some data and needs some unreliable > planner logic to work around this inefficiency. And now we also have > execution-time fallback because planning-time fallback isn't good > enough. This looks overly complicated. Let's try to design an algorithm > that works well regardless of the particular data and doesn't need these > heuristics. It should be possible to do so for btree. > > Of course, I understand the reluctance to implement an entire new type > of btree traversal. Anastasia Lubennikova suggested a tweak for the > current method that may improve the performance for small groups of > equal values. When we search for the next unique key, first check if it > is contained on the current btree page using its 'high key'. If it is, > find it on the current page. If not, search from the root of the tree > like we do now. This should improve the performance for small equal > groups, because there are going to be several such groups on the page. > And this is exactly where we have the regression compared to unique + > index scan. > Robert suggested something similar in [1]. I'll try and look at that when I'm back from my holiday. > By the way, what is the data for which we intend this feature to work? > Obviously a non-unique btree index, but how wide are the tuples, and how > big the equal groups? It would be good to have some real-world examples. > Although my primary use-case is int I agree that we should test the different data types, and tuple widths. [1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com Best regards, Jesper
> On Fri, 16 Nov 2018 at 16:06, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote: > >>> But having this logic inside _bt_next means that it will make a > >>> non-skip index > >>> only scan a bit slower, am I right? > >> > >> Correct. > > > > Well, it depends on how the skip scan is implemented. We don't have to > > make normal scans slower, because skip scan is just a separate thing. > > > > My main point was that current implementation is good as a proof of > > concept, but it is inefficient for some data and needs some unreliable > > planner logic to work around this inefficiency. And now we also have > > execution-time fallback because planning-time fallback isn't good > > enough. This looks overly complicated. Let's try to design an algorithm > > that works well regardless of the particular data and doesn't need these > > heuristics. It should be possible to do so for btree. > > > > Of course, I understand the reluctance to implement an entire new type > > of btree traversal. Anastasia Lubennikova suggested a tweak for the > > current method that may improve the performance for small groups of > > equal values. When we search for the next unique key, first check if it > > is contained on the current btree page using its 'high key'. If it is, > > find it on the current page. If not, search from the root of the tree > > like we do now. This should improve the performance for small equal > > groups, because there are going to be several such groups on the page. > > And this is exactly where we have the regression compared to unique + > > index scan. > > Robert suggested something similar in [1]. I'll try and look at that > when I'm back from my holiday. Yeah, probably you're right. Unfortunately, I've misunderstood the previous Robert's message in this thread with the similar approach. Jesper, I hope you don't mind if I'll post an updated patch? _bt_skip is changed there in the suggested way, so that it checks the current page before searching from the root of a tree, and I've removed the fallback logic. After some initial tests I see that with this version skip scan over a table with 10^7 rows and 10^6 distinct values is slightly slower than a regular scan, but not that much. > > By the way, what is the data for which we intend this feature to work? > > Obviously a non-unique btree index, but how wide are the tuples, and how > > big the equal groups? It would be good to have some real-world examples. > > Although my primary use-case is int I agree that we should test the > different data types, and tuple widths. My personal motivation here is exactly that we face use-cases for skip scan from time to time. Usually it's quite few distinct values (up to a dozen or so, which means that equal groups are quite big), but with the variety of types and widths. > On Thu, 15 Nov 2018 at 15:28, James Coleman <jtc331@gmail.com> wrote: > > > Is skip scan only possible for index-only scan? > > I haven't see much discussion of this question yet. Is there a > particular reason to lock ourselves into thinking about this only in > an index only scan? I guess, the only reason is to limit the scope of the patch.
Attachment
On 11/18/18 02:27, Dmitry Dolgov wrote: > > [0001-Index-skip-scan-v4.patch] I ran a couple of tests on this, please see the cases below. As before, I'm setting total_cost = 1 for index skip scan so that it is chosen. Case 1 breaks because we determine the high key incorrectly, it is the second tuple on page or something like that, not the last tuple. Case 2 is backwards scan, I don't understand how it is supposed to work. We call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it just fetches the previous tuple like the regular scan does. case 1: # create table t as select generate_series(1, 1000000) a; # create index ta on t(a); # explain select count(*) from (select distinct a from t) d; QUERY PLAN ------------------------------------------------------------------------- Aggregate (cost=3.50..3.51 rows=1 width=8) -> Index Only Scan using ta on t (cost=0.42..1.00 rows=200 width=4) Scan mode: Skip scan (3 rows) postgres=# select count(*) from (select distinct a from t) d; count -------- 500000 -- should be 1kk (1 row) case 2: # create table t as select generate_series(1, 1000000) / 2 a; # create index ta on t(a); # explain select count(*) from (select distinct a from t order by a desc) d; QUERY PLAN ------------------------------------------------------------------------------------- Aggregate (cost=5980.81..5980.82 rows=1 width=8) -> Index Only Scan Backward using ta on t (cost=0.42..1.00 rows=478385 width=4) Scan mode: Skip scan (3 rows) # select count(*) from (select distinct a from t order by a desc) d; count -------- 502733 -- should be 500k (1 row) -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote: > > On 11/18/18 02:27, Dmitry Dolgov wrote: > > > > [0001-Index-skip-scan-v4.patch] > > I ran a couple of tests on this, please see the cases below. As before, > I'm setting total_cost = 1 for index skip scan so that it is chosen. > Case 1 breaks because we determine the high key incorrectly, it is the > second tuple on page or something like that, not the last tuple. From what I see it wasn't about the high key, just a regular off by one error. But anyway, thanks for noticing - for some reason it wasn't always reproduceable for me, so I missed this issue. Please find fixed patch attached. Also I think it invalidates my previous performance tests, so I would appreciate if you can check it out too. > Case 2 > is backwards scan, I don't understand how it is supposed to work. We > call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it > just fetches the previous tuple like the regular scan does. Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I remember from the previous discussions the entire topic of backward scan is questionable for this patch, so I'll try to invest some time in it.
Attachment
On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I > remember from the previous discussions the entire topic of backward scan is > questionable for this patch, so I'll try to invest some time in it. Another thing that I think is related to skip scans that you should be aware of is dynamic prefix truncation, which I started a thread on just now [1]. While I see one big problem with the POC patch I came up with, I think that that optimization is going to be something that ends up happening at some point. Repeatedly descending a B-Tree when the leading column is very low cardinality can be made quite a lot less expensive by dynamic prefix truncation. Actually, it's almost a perfect case for it. I'm not asking anybody to do anything with that information. "Big picture" thinking seems particularly valuable when working on the B-Tree code; I don't want anybody to miss a possible future opportunity. [1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan
> On Wed, Nov 21, 2018 at 9:56 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote: > > > > On 11/18/18 02:27, Dmitry Dolgov wrote: > > > > > > [0001-Index-skip-scan-v4.patch] > > > > I ran a couple of tests on this, please see the cases below. As before, > > I'm setting total_cost = 1 for index skip scan so that it is chosen. > > Case 1 breaks because we determine the high key incorrectly, it is the > > second tuple on page or something like that, not the last tuple. > > From what I see it wasn't about the high key, just a regular off by one error. > But anyway, thanks for noticing - for some reason it wasn't always > reproduceable for me, so I missed this issue. Please find fixed patch attached. > Also I think it invalidates my previous performance tests, so I would > appreciate if you can check it out too. I've performed some testing, and on my environment with a dataset of 10^7 records: * everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip scan. * above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique records it was about 20% slower than the regular index scan. For me these numbers sound good, since even in quite extreme case of approximately 10 records per group the performance of index skip scan is close to the same for the regular index only scan. > On Tue, Dec 4, 2018 at 4:26 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I > > remember from the previous discussions the entire topic of backward scan is > > questionable for this patch, so I'll try to invest some time in it. > > Another thing that I think is related to skip scans that you should be > aware of is dynamic prefix truncation, which I started a thread on > just now [1]. While I see one big problem with the POC patch I came up > with, I think that that optimization is going to be something that > ends up happening at some point. Repeatedly descending a B-Tree when > the leading column is very low cardinality can be made quite a lot > less expensive by dynamic prefix truncation. Actually, it's almost a > perfect case for it. Thanks, sounds cool. I'll try it out as soon as I'll have some spare time.
> On Thu, Dec 20, 2018 at 2:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > I've performed some testing, and on my environment with a dataset of 10^7 > records: > > * everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip > scan. > > * above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique > records it was about 20% slower than the regular index scan. > > For me these numbers sound good, since even in quite extreme case of > approximately 10 records per group the performance of index skip scan is close > to the same for the regular index only scan. Rebased version after rd_amroutine was renamed.
Attachment
> On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Rebased version after rd_amroutine was renamed. And one more to fix the documentation. Also I've noticed few TODOs in the patch about the missing docs, and replaced them with a required explanation of the feature.
Attachment
> On Sun, Jan 27, 2019 at 6:17 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > Rebased version after rd_amroutine was renamed. > > And one more to fix the documentation. Also I've noticed few TODOs in the patch > about the missing docs, and replaced them with a required explanation of the > feature. A bit of adjustment after nodes/relation -> nodes/pathnodes.
Attachment
Hello. At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com> > A bit of adjustment after nodes/relation -> nodes/pathnodes. I had a look on this. The name "index skip scan" is a different feature from the feature with the name on other prodcuts, which means "index scan with postfix key (of mainly of multi column key) that scans ignoring the prefixing part" As Thomas suggested I'd suggest that we call it "index hop scan". (I can accept Hopscotch, either:p) Also as mentioned upthread by Peter Geoghegan, this could easly give worse plan by underestimation. So I also suggest that this has dynamic fallback function. In such perspective it is not suitable for AM API level feature. If all leaf pages are on the buffer and the average hopping distance is less than (expectedly) 4 pages (the average height of the tree), the skip scan will lose. If almost all leaf pages are staying on disk, we could win only by 2-pages step (skipping over one page). ===== As I'm writing the above, I came to think that it's better implement this as an pure executor optimization. Specifically, let _bt_steppage count the ratio of skipped pages so far then if the ratio exceeds some threshold (maybe around 3/4) it gets into hopscotching mode, where it uses index scan to find the next page (rather than traversing). As mentioned above, I think using skip scan to go beyond the next page is a good bet. If the success ration of skip scans goes below some threshold (I'm not sure for now), we should fall back to traversing. Any opinions? ==== Some comments on the patch below. + skip scan approach, which is based on the idea of + <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems"> + Loose index scan</ulink>. Rather than scanning all equal values of a key, + as soon as a new value is found, it will search for a larger value on the I'm not sure it is a good thing to put a pointer to rather unstable source in the documentation. This adds a new AM method but it seems avaiable only for ordered indexes, specifically btree. And it seems that the feature can be implemented in btgettuple since btskip apparently does the same thing. (I agree to Robert in the point in [1]). [1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com Related to the above, it seems better that the path generation of skip scan is a part of index scan. Whether skip scan or not is a matter of index scan itself. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 1/31/19 1:31 AM, Kyotaro HORIGUCHI wrote: > At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com> >> A bit of adjustment after nodes/relation -> nodes/pathnodes. > > I had a look on this. > > The name "index skip scan" is a different feature from the > feature with the name on other prodcuts, which means "index scan > with postfix key (of mainly of multi column key) that scans > ignoring the prefixing part" As Thomas suggested I'd suggest that > we call it "index hop scan". (I can accept Hopscotch, either:p) > > Also as mentioned upthread by Peter Geoghegan, this could easly > give worse plan by underestimation. So I also suggest that this > has dynamic fallback function. In such perspective it is not > suitable for AM API level feature. > > If all leaf pages are on the buffer and the average hopping > distance is less than (expectedly) 4 pages (the average height of > the tree), the skip scan will lose. If almost all leaf pages are > staying on disk, we could win only by 2-pages step (skipping over > one page). > > ===== > As I'm writing the above, I came to think that it's better > implement this as an pure executor optimization. > > Specifically, let _bt_steppage count the ratio of skipped pages > so far then if the ratio exceeds some threshold (maybe around > 3/4) it gets into hopscotching mode, where it uses index scan to > find the next page (rather than traversing). As mentioned above, > I think using skip scan to go beyond the next page is a good > bet. If the success ration of skip scans goes below some > threshold (I'm not sure for now), we should fall back to > traversing. > > Any opinions? > > ==== > > Some comments on the patch below. > > + skip scan approach, which is based on the idea of > + <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems"> > + Loose index scan</ulink>. Rather than scanning all equal values of a key, > + as soon as a new value is found, it will search for a larger value on the > > I'm not sure it is a good thing to put a pointer to rather > unstable source in the documentation. > > > This adds a new AM method but it seems avaiable only for ordered > indexes, specifically btree. And it seems that the feature can be > implemented in btgettuple since btskip apparently does the same > thing. (I agree to Robert in the point in [1]). > > [1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com > > > Related to the above, it seems better that the path generation of > skip scan is a part of index scan. Whether skip scan or not is a > matter of index scan itself. > Thanks for your valuable feedback ! And, calling it "Loose index scan" or something else is better. Dmitry and I will look at this and take it into account for the next version. For now, I have switched the CF entry to WoA. Thanks again ! Best regards, Jesper
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Also as mentioned upthread by Peter Geoghegan, this could easly > give worse plan by underestimation. So I also suggest that this > has dynamic fallback function. In such perspective it is not > suitable for AM API level feature. > > If all leaf pages are on the buffer and the average hopping > distance is less than (expectedly) 4 pages (the average height of > the tree), the skip scan will lose. If almost all leaf pages are > staying on disk, we could win only by 2-pages step (skipping over > one page). > > ===== > As I'm writing the above, I came to think that it's better > implement this as an pure executor optimization. > > Specifically, let _bt_steppage count the ratio of skipped pages > so far then if the ratio exceeds some threshold (maybe around > 3/4) it gets into hopscotching mode, where it uses index scan to > find the next page (rather than traversing). As mentioned above, > I think using skip scan to go beyond the next page is a good > bet. If the success ration of skip scans goes below some > threshold (I'm not sure for now), we should fall back to > traversing. > > Any opinions? Hi! I'd like to offer a counterpoint: in cases where this a huge win we definitely do want this to affect cost estimation, because if it's purely an executor optimization the index scan path may not be chosen even when skip scanning would be a dramatic improvement. I suppose that both requirements could be met by incorporating it into the existing index scanning code and also modifying to costing to (only when we have high confidence?) account for the optimization. I'm not sure if that makes things better than the current state of the patch or not. James Coleman
On 2019-02-01 16:04:58 -0500, James Coleman wrote: > On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Also as mentioned upthread by Peter Geoghegan, this could easly > > give worse plan by underestimation. So I also suggest that this > > has dynamic fallback function. In such perspective it is not > > suitable for AM API level feature. > > > > If all leaf pages are on the buffer and the average hopping > > distance is less than (expectedly) 4 pages (the average height of > > the tree), the skip scan will lose. If almost all leaf pages are > > staying on disk, we could win only by 2-pages step (skipping over > > one page). > > > > ===== > > As I'm writing the above, I came to think that it's better > > implement this as an pure executor optimization. > > > > Specifically, let _bt_steppage count the ratio of skipped pages > > so far then if the ratio exceeds some threshold (maybe around > > 3/4) it gets into hopscotching mode, where it uses index scan to > > find the next page (rather than traversing). As mentioned above, > > I think using skip scan to go beyond the next page is a good > > bet. If the success ration of skip scans goes below some > > threshold (I'm not sure for now), we should fall back to > > traversing. > > > > Any opinions? > > I'd like to offer a counterpoint: in cases where this a huge win we > definitely do want this to affect cost estimation, because if it's > purely an executor optimization the index scan path may not be chosen > even when skip scanning would be a dramatic improvement. +many. Greetings, Andres Freund
> On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > Dmitry and I will look at this and take it into account for the next > version. In the meantime, just to not forget, I'm going to post another version with a fix for cursor fetch backwards, which was crashing before. And talking about this topic I wanted to ask to clarify a few points, since looks like I'm missing something: One of not yet addressed points in this patch is amcanbackward. From the historical thread, mentioned in the first email: > On 2016-11-25 at 01:33 Robert Haas <robertmhaas@gmail.com> wrote: > + if (ScanDirectionIsForward(dir)) > + { > + so->currPos.moreLeft = false; > + so->currPos.moreRight = true; > + } > + else > + { > + so->currPos.moreLeft = true; > + so->currPos.moreRight = false; > + } > > > > The lack of comments makes it hard for me to understand what the > motivation for this is, but I bet it's wrong. Suppose there's a > cursor involved here and the user tries to back up. Instead of having > a separate amskip operation, maybe there should be a flag attached to > a scan indicating whether it should return only distinct results. > Otherwise, you're allowing for the possibility that the same scan > might sometimes skip and other times not skip, but then it seems hard > for the scan to survive cursor operations. Anyway, why would that be > useful? I assume that "sometimes skip and other times not skip" refers to the situation, when we did fetch forward and jump something over, and then right away doing fetch backwards, when we don't actually need to skip anything and can get the result right away, right? If so, I can't find any comments about why is it should be a problem for cursor operations?
Attachment
> On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> Dmitry and I will look at this and take it into account for the next
> version.
In the meantime, just to not forget, I'm going to post another version with a
fix for cursor fetch backwards, which was crashing before.
Hello.
At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
> A bit of adjustment after nodes/relation -> nodes/pathnodes.
I had a look on this.
The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)
explain analyze with t as (select distinct bid from pgbench_accounts )
On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com> >> > A bit of adjustment after nodes/relation -> nodes/pathnodes. >> >> I had a look on this. >> >> The name "index skip scan" is a different feature from the >> feature with the name on other prodcuts, which means "index scan >> with postfix key (of mainly of multi column key) that scans >> ignoring the prefixing part" As Thomas suggested I'd suggest that >> we call it "index hop scan". (I can accept Hopscotch, either:p) > > > I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, nota fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself outto give the same answer, but faster. I think they would also use this skip method to get distinct values if that is whatis requested. But they would go beyond that to also use it to do something similar to the plan we get with this: Hi Jeff, "Hop scan" was just a stupid joke that occurred to me when I saw that DB2 had gone for "jump scan". I think "skip scan" is a perfectly good name and it's pretty widely used by now (for example, by our friends over at SQLite to blow us away at these kinds of queries). Yes, simple distinct value scans are indeed just the easiest kind of thing to do with this kind of scan-with-fast-forward. As discussed already in this thread and the earlier one there is a whole family of tricks you can do, and the thing that most people call an "index skip scan" is indeed the try-every-prefix case where you can scan an index on (a, b) given a WHERE clause b = x. Perhaps the simple distinct scan could appear as "Distinct Index Skip Scan"? And perhaps the try-every-prefix-scan could appear as just "Index Skip Scan"? Whether these are the same executor node is a good question; at one point I proposed a separate nest-loop like node for the try-every-prefix-scan, but Robert shot that down pretty fast. I now suspect (as he said) that all of this belongs in the index scan node, as different modes. The behaviour is overlapping; for "Distinct Index Skip Scan" you skip to each distinct prefix and emit one tuple, whereas for "Index Skip Scan" you skip to each distinct prefix and then perform a regular scan for the prefix + the suffix emitting matches. > (which currently takes 200ms, rather than the 0.9ms taken for the one benefiting from skip scan) Nice. > I don't think we should give it a different name, just because our initial implementation is incomplete. +1 > Or do you think our implementation of one feature does not really get us closer to implementing the other? My question when lobbing the earlier sketch patches into the mailing list a few years back was: is this simple index AM interface and implementation (once debugged) powerful enough for various kinds of interesting skip-based plans? So far I have the impression that it does indeed work for Distinct Index Skip Scan (demonstrated), Index Skip Scan (no one has yet tried that), and special cases of extrema aggregate queries (foo, MIN(bar) can be performed by skip scan of index on (foo, bar)), but may not work for the semi-related merge join trickery mentioned in a paper posted some time back (though I don't recall exactly why). Another question is whether it should all be done by the index scan node, and I think the answer is yet. -- Thomas Munro https://enterprisedb.com
On 3/1/19 12:03 AM, Thomas Munro wrote: > On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff.janes@gmail.com> wrote: >> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com> >>>> A bit of adjustment after nodes/relation -> nodes/pathnodes. >>> >>> I had a look on this. >>> >>> The name "index skip scan" is a different feature from the >>> feature with the name on other prodcuts, which means "index scan >>> with postfix key (of mainly of multi column key) that scans >>> ignoring the prefixing part" As Thomas suggested I'd suggest that >>> we call it "index hop scan". (I can accept Hopscotch, either:p) >> >> >> I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan,not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itselfout to give the same answer, but faster. I think they would also use this skip method to get distinct values if thatis what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this: > > Hi Jeff, > > "Hop scan" was just a stupid joke that occurred to me when I saw that > DB2 had gone for "jump scan". I think "skip scan" is a perfectly good > name and it's pretty widely used by now (for example, by our friends > over at SQLite to blow us away at these kinds of queries). > +1 to "hop scan" regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Thu, Feb 28, 2019 at 10:45 PM Jeff Janes <jeff.janes@gmail.com> wrote: > > This version of the patch can return the wrong answer. Yes, indeed. In fact it answers my previous question related to the backward cursor scan, when while going back we didn't skip enough. Within the current approach it can be fixed by proper skipping for backward scan, something like in the attached patch. Although there are still some rough edges, e.g. going forth, back and forth again leads to a sutiation, when `_bt_first` is not applied anymore and the first element is wrongly skipped. I'll try to fix it with the next version of patch. > If we accept this patch, I hope it would be expanded in the future to give > similar performance as the above query does even when the query is written in > its more natural way of: Yeah, I hope the current approach with a new index am routine can be extended for that. > Without the patch, instead of getting a wrong answer, I get an error: Right, as far as I can see without a skip scan and SCROLL, a unique + index scan is used, where amcanbackward is false by default. So looks like it's not really patch related.
Attachment
> On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Although there are still some rough edges, e.g. going forth, back and forth > again leads to a sutiation, when `_bt_first` is not applied anymore and the > first element is wrongly skipped. I'll try to fix it with the next version of > patch. It turns out that `_bt_skip` was unnecessary applied every time when scan was restarted from the beginning. Here is the fixed version of patch.
Attachment
At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com> > > On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > Although there are still some rough edges, e.g. going forth, back and forth > > again leads to a sutiation, when `_bt_first` is not applied anymore and the > > first element is wrongly skipped. I'll try to fix it with the next version of > > patch. > > It turns out that `_bt_skip` was unnecessary applied every time when scan was > restarted from the beginning. Here is the fixed version of patch. > nbtsearch.c: In function ‘_bt_skip’: > nbtsearch.c:1292:11: error: ‘struct IndexScanDescData’ has no member named ‘xs_ctup’; did you mean ‘xs_itup’? > scan->xs_ctup.t_self = currItem->heapTid; Unfortunately a recent commit c2fe139c20 hit this. Date: Mon Mar 11 12:46:41 2019 -0700 > Index scans now store the result of a search in > IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the > target is not generally a HeapTuple anymore that seems cleaner. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com> > > On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > Although there are still some rough edges, e.g. going forth, back and forth > > again leads to a sutiation, when `_bt_first` is not applied anymore and the > > first element is wrongly skipped. I'll try to fix it with the next version of > > patch. > > It turns out that `_bt_skip` was unnecessary applied every time when scan was > restarted from the beginning. Here is the fixed version of patch. I have some comments on the latest v11 patch. L619: > + indexstate->ioss_NumDistinctKeys = node->distinctPrefix; The number of distinct prefix keys has various names in this patch. They should be unified as far as possible. L:728 > + root->distinct_pathkeys > 0) It is not an integer, but a list. L730: > + Path *subpath = (Path *) > + create_skipscan_unique_path(root, The name "subpath" here is not a subpath, but it can be removed by directly calling create_skipscan_unique_path in add_path. L:758 > +create_skipscan_unique_path(PlannerInfo *root, > + RelOptInfo *rel, > + Path *subpath, > + int numCols, The "subpath" is not a subpath. How about basepath, or orgpath? The "numCols" doesn't makes clear sense. unique_prefix_keys? L764: > + IndexPath *pathnode = makeNode(IndexPath); > + > + Assert(IsA(subpath, IndexPath)); > + > + /* We don't want to modify subpath, so make a copy. */ > + memcpy(pathnode, subpath, sizeof(IndexPath)); Why don't you just use copyObject()? L773: > + Assert(numCols > 0); Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ? L586: > + * Check if we need to skip to the next key prefix, because we've been > + * asked to implement DISTINCT. > + */ > + if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted) > + { > + if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys)) > + { > + /* Reached end of index. At this point currPos is invalidated, I thought a while on this bit. It seems that the lower layer must know whether it has emitted the first tuple. So I think that this code can be reduced as the follows. > if (node->ioss_NumDistinctKeys && > !index_skip(scandesc, direction, node->ioss_NumDistinctKeys)) > return ExecClearTupler(slot); Then, the index_skip returns true with doing nothing if the scandesc is in the initial state. (Of course other index AMs can do something in the first call.) ioss_FirstTupleEmitted and the comment can be removed. By the way this patch seems to still be forgetting about the explicit rescan case but just doing this makes such consideration not required. L1032: > + Index Only Scan using tenk1_four on public.tenk1 > + Output: four > + Scan mode: Skip scan The "Scan mode" has only one value and it is shown only for "Index Only Scan" case. It seems to me that "Index Skip Scan" implies Index Only Scan. How about just "Index Skip Scan"? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I have some comments on the latest v11 patch. Thank you! > L619: > > + indexstate->ioss_NumDistinctKeys = node->distinctPrefix; > > The number of distinct prefix keys has various names in this > patch. They should be unified as far as possible. Good point, I've renamed everything to skipPrefixSize, it seems for me that this name should be self explanatory enough. > L:728 > > + root->distinct_pathkeys > 0) > > It is not an integer, but a list. Thanks for noticing, fixed (via compare with NIL, since we just need to know if this list is empty or not). > L730: > > + Path *subpath = (Path *) > > + create_skipscan_unique_path(root, > > The name "subpath" here is not a subpath, but it can be removed > by directly calling create_skipscan_unique_path in add_path. > > > L:758 > > +create_skipscan_unique_path(PlannerInfo *root, > > + RelOptInfo *rel, > > + Path *subpath, > > + int numCols, > > The "subpath" is not a subpath. How about basepath, or orgpath? > The "numCols" doesn't makes clear sense. unique_prefix_keys? I agree, suggested names sound good. > L764: > > + IndexPath *pathnode = makeNode(IndexPath); > > + > > + Assert(IsA(subpath, IndexPath)); > > + > > + /* We don't want to modify subpath, so make a copy. */ > > + memcpy(pathnode, subpath, sizeof(IndexPath)); > > Why don't you just use copyObject()? Maybe I'm missing something, but I don't see that copyObject works with path nodes, does it? I've tried it with subpath directly and got `unrecognized node type`. > L773: > > + Assert(numCols > 0); > > Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ? Yeah, makes sense. > L586: > > + * Check if we need to skip to the next key prefix, because we've been > > + * asked to implement DISTINCT. > > + */ > > + if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted) > > + { > > + if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys)) > > + { > > + /* Reached end of index. At this point currPos is invalidated, > > I thought a while on this bit. It seems that the lower layer must > know whether it has emitted the first tuple. So I think that this > code can be reduced as the follows. > > > if (node->ioss_NumDistinctKeys && > > !index_skip(scandesc, direction, node->ioss_NumDistinctKeys)) > > return ExecClearTupler(slot); > > Then, the index_skip returns true with doing nothing if the > scandesc is in the initial state. (Of course other index AMs can > do something in the first call.) ioss_FirstTupleEmitted and the > comment can be removed. I'm not sure then, how to figure out when scandesc is in the initial state from the inside index_skip without passing the node as an argument? E.g. in the case, describe in the commentary, when we do fetch forward/fetch backward/fetch forward again. > L1032: > > + Index Only Scan using tenk1_four on public.tenk1 > > + Output: four > > + Scan mode: Skip scan > > The "Scan mode" has only one value and it is shown only for > "Index Only Scan" case. It seems to me that "Index Skip Scan" > implies Index Only Scan. How about just "Index Skip Scan"? Do you mean, show "Index Only Scan", and then "Index Skip Scan" in details, instead of "Scan mode", right?
Attachment
> On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I have some comments on the latest v11 patch. > > Thank you! In the meantime here is a new version, rebased after tableam changes.
Attachment
> On Tue, Mar 19, 2019 at 2:07 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > I have some comments on the latest v11 patch. > > > > Thank you! > > In the meantime here is a new version, rebased after tableam changes. Rebase after refactoring of nbtree insertion scankeys. But so far it's purely mechanical, just to make it work - I guess I'll need to try to rewrite some parts of the patch, that don't look natural now, accordingly. And maybe to leverage dynamic prefix truncation per Peter suggestion.
Attachment
> On Thu, Mar 28, 2019 at 11:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Rebase after refactoring of nbtree insertion scankeys. But so far it's purely > mechanical, just to make it work - I guess I'll need to try to rewrite some > parts of the patch, that don't look natural now, accordingly. Here is the updated version with the changes I was talking about (mostly about readability and code cleanup). I've also added few tests for a cursor behaviour.
Attachment
> On Sat, May 11, 2019 at 6:35 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Here is the updated version with the changes I was talking about (mostly about > readability and code cleanup). I've also added few tests for a cursor behaviour. And one more cosmetic rebase after pg_indent.
Attachment
After some talks with Jesper at PGCon about the Index Skip Scan, I started testing this patch, because it seems to have great potential in speeding up many of our queries (great conference by the way, really enjoyed my first time being there!). I haven't looked in depth to the code itself, but I focused on some testing with real data that we have.
Let me start by sketching our primary use case for this, as it is similar, but slightly different than what was discussed earlier in this thread. I think this use case is something a lot of people who handle timeseries data have. Our database has many tables with timeseries data. We don't update rows, but just insert new rows each time. One example of this would be a table with prices for instruments. Instruments are identified by a column called feedcode. Prices of instrument update with a certain frequency. Each time it updates we insert a new row with the new value and the timestamp at that time. So in the simplest form, you could see it as a table like this:
create table prices (feedcode text, updated_at timestamptz, value float8); -- there'll be some other columns as well, this is just an example
create unique index on prices (feedcode, updated_at desc);
This table perfectly fits the criteria for the index skip scan as there are relatively few distinct feedcodes, but each of them has quite a lot of price inserts (imagine 1000 distinct feedcodes, each of them having one price per second). We normally partition our data by a certain time interval, so let's say we're just looking at one day of prices here. We have other cases with higher update frequencies and/or more distinct values though.
Typical queries on this table involve querying the price at a certain point in time, or simply querying the latest update. If we know the feedcode, this is easy:
select * from prices where feedcode='A' and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1
Unfortunately, this gets hard if you want to know the price of everything at a certain point in time. The query then becomes:
select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc
Up until now (even with this patch) this uses a regular index scan + a unique node which scans the full index, which is terribly slow and is also not constant - as the table grows it becomes slower and slower.
Obviously there are currently already ways to speed this up using the recursive loose index scan, but I think everybody agrees that those are pretty unreadable. However, since they're also several orders of magnitude faster, we actually use them everywhere. Eg.
-- certain point in time
-- first query *all* distinct feedcodes (disregarding time), then look do an index scan for every feedcode found to see if it has an update in the time window we're interested in
-- this essentially means we're doing 2*N index scans
I hope this makes our current situation clear. Please do ask if I need to elaborate on something here.
So what changes with this patch? The great thing is that the recursive CTE is not required anymore! This is a big win for readability and it helps performance as well. It makes everything much better. I am really happy with these results. If the index skip scan triggers, it is easily over 100x faster than the naive 'distinct on' query in earlier versions of Postgres. It is also quite a bit faster than the recursive CTE version of the query.
I have a few remarks though. I tested some of our queries with the patch and found that the following query would (with patch) work best now for arbitrary timestamps:
-- first query all distinct values using the index skip scan, then do an index scan for each of these
While certainly a big improvement over the recursive CTE, it would be nice if the even simpler form with the 'distinct on' would work out of the box using an index skip scan.
select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc
As far as I can see, there are two main problems with that at the moment.
1) Only support for Index-Only scan at the moment, not for regular index scans. This was already mentioned upthread and I can understand that it was left out until now to constrain the scope of this. However, if we were to support 'distinct on' + selecting columns that are not part of the index we need a regular index scan instead of the index only scan.
2) The complicating part that we're interested in the value 'at a specific point in time'. This comparison with updated_at messes up all efficiency, as the index scan first looks at the *latest* updated_at for a certain feedcode and then walks the tree until it finds a tuple that matches the updated_at criteria (which possibly never happens, in which case it will happily walk over the full index). I'm actually unsure if there is anything we can do about this (without adding a lot of complexity), aside from just rewriting the query itself in the way I did where we're doing the skip scan for all distinct items followed by a second round of index scans for the specific point in time. I'm struggling a bit to explain this part clearly - I hope it's clear though. Please let me know if I should elaborate. Perhaps it's easiest to see by the difference in speed between the following two queries:
select distinct feedcode from prices -- approx 10ms
select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms
Both use the index skip scan, but the first one is very fast, because it can skip large parts of the index. The second one scans the full index, because it never finds any row that matches the where condition so it can never skip anything.
Thanks for this great patch. It is already very useful and fills a gap that has existed for a long time. It is going to make our queries so much more readable and performant if we won't have to resort to recursive CTEs anymore!
-Floris
Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check?
Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.
The following query uses an index skip scan and returns just 1 row (incorrect!)
It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped.
-Floris
On Sat, 1 Jun 2019 at 06:10, Floris Van Nee <florisvannee@optiver.com> wrote: > > Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be ableto check? > I am willing to give it a try. > Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market,feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) andfilling it with data. > > > The following query uses an index skip scan and returns just 1 row (incorrect!) > > select distinct on (market, feedcode) market, feedcode > from streams.base_price > where market='TEST' > > The following query still uses the regular index scan and returns many more rows (correct) > select distinct on (market, feedcode) * > from streams.base_price > where market='TEST' > Aren't those two queries different? select distinct on (market, feedcode) market, feedcode vs select distinct on (market, feedcode)* Anyhow, it's just the difference in projection so doesn't matter much. I verified this scenario at my end and you are right, there is a bug. Here is my repeatable test case, create table t (market text, feedcode text, updated_at timestamptz, value float8) ; create index on t (market, feedcode, updated_at desc); insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88); insert into t values('TEST', 'jsgfhdfjd', (select timestamp '2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88); Now, without the patch, select distinct on (market, feedcode) market, feedcode from t where market='TEST'; market | feedcode --------+----------- TEST | abcdef TEST | jsgfhdfjd (2 rows) explain select distinct on (market, feedcode) market, feedcode from t where market='TEST'; QUERY PLAN ---------------------------------------------------------------- Unique (cost=12.20..13.21 rows=2 width=13) -> Sort (cost=12.20..12.70 rows=201 width=13) Sort Key: feedcode -> Seq Scan on t (cost=0.00..4.51 rows=201 width=13) Filter: (market = 'TEST'::text) (5 rows) And with the patch, select distinct on (market, feedcode) market, feedcode from t where market='TEST'; market | feedcode --------+---------- TEST | abcdef (1 row) explain select distinct on (market, feedcode) market, feedcode from t where market='TEST'; QUERY PLAN ------------------------------------------------------------------------------------------------ Index Only Scan using t_market_feedcode_updated_at_idx on t (cost=0.14..0.29 rows=2 width=13) Scan mode: Skip scan Index Cond: (market = 'TEST'::text) (3 rows) Notice that in the explain statement it shows correct number of rows to be skipped. -- Regards, Rafia Sabih
> On Sat, Jun 1, 2019 at 6:10 AM Floris Van Nee <florisvannee@optiver.com> wrote: > > After some talks with Jesper at PGCon about the Index Skip Scan, I started > testing this patch, because it seems to have great potential in speeding up > many of our queries (great conference by the way, really enjoyed my first > time being there!). I haven't looked in depth to the code itself, but I > focused on some testing with real data that we have. Thanks! > Actually I'd like to add something to this. I think I've found a bug in the > current implementation. Would someone be able to check? > > The following query uses an index skip scan and returns just 1 row (incorrect!) > > select distinct on (market, feedcode) market, feedcode > from streams.base_price > where market='TEST' > > The following query still uses the regular index scan and returns many more > rows (correct) > select distinct on (market, feedcode) * > from streams.base_price > where market='TEST' Yes, good catch, I'll investigate. Looks like in the current implementation something is not quite right, when we have this order of columns in an index and where clause (e.g. in the examples above everything seems fine if we create index over (feedcode, market) and not the other way around). > As far as I can see, there are two main problems with that at the moment. > > 1) Only support for Index-Only scan at the moment, not for regular index > scans. This was already mentioned upthread and I can understand that it > was left out until now to constrain the scope of this. However, if we were > to support 'distinct on' + selecting columns that are not part of the > index we need a regular index scan instead of the index only scan. Sure, it's something I hope we can tackle as the next step. > select distinct feedcode from prices -- approx 10ms > > select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms > > Both use the index skip scan, but the first one is very fast, because it can > skip large parts of the index. The second one scans the full index, because > it never finds any row that matches the where condition so it can never skip > anything. Interesting, I'll take a closer look.
Hi,
Thanks for the helpful replies.
> something is not quite right, when we have this order of columns in an index
> and where clause (e.g. in the examples above everything seems fine if we create
> index over (feedcode, market) and not the other way around).
> On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > I did a little bit of investigation and it seems to occur because in > pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if > there is an equality condition with a constant in the corresponding WHERE > clause. > ... > However, the index skip scan interprets this as that it has to skip over just > the first column. Right, passing correct number of columns fixes this particular problem. But while debugging I've also discovered another related issue, when the current implementation seems to have a few assumptions, that are not correct if we have an index condition and a distinct column is not the first in the index. I'll try to address these in a next version of the patch in the nearest future.
Hi Floris, On 6/1/19 12:10 AM, Floris Van Nee wrote: > Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market,feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) andfilling it with data. > > > The following query uses an index skip scan and returns just 1 row (incorrect!) > > select distinct on (market, feedcode) market, feedcode > from streams.base_price > where market='TEST' > > The following query still uses the regular index scan and returns many more rows (correct) > select distinct on (market, feedcode) * > from streams.base_price > where market='TEST' > > > It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in theindex are skipped. > > Thanks for taking a look at the patch, and your feedback on it. I'll def look into this once I'm back from my travels. Best regards, Jesper
Hi Rafia, On 6/1/19 6:03 AM, Rafia Sabih wrote: > Here is my repeatable test case, > > create table t (market text, feedcode text, updated_at timestamptz, > value float8) ; > create index on t (market, feedcode, updated_at desc); > insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10 > 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp > '2019-01-20 20:00:00') ), generate_series(1,100)*9.88); > insert into t values('TEST', 'jsgfhdfjd', (select timestamp > '2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - > timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88); > > Now, without the patch, > select distinct on (market, feedcode) market, feedcode from t where > market='TEST'; > market | feedcode > --------+----------- > TEST | abcdef > TEST | jsgfhdfjd > (2 rows) > explain select distinct on (market, feedcode) market, feedcode from t > where market='TEST'; > QUERY PLAN > ---------------------------------------------------------------- > Unique (cost=12.20..13.21 rows=2 width=13) > -> Sort (cost=12.20..12.70 rows=201 width=13) > Sort Key: feedcode > -> Seq Scan on t (cost=0.00..4.51 rows=201 width=13) > Filter: (market = 'TEST'::text) > (5 rows) > > And with the patch, > select distinct on (market, feedcode) market, feedcode from t where > market='TEST'; > market | feedcode > --------+---------- > TEST | abcdef > (1 row) > > explain select distinct on (market, feedcode) market, feedcode from t > where market='TEST'; > QUERY PLAN > ------------------------------------------------------------------------------------------------ > Index Only Scan using t_market_feedcode_updated_at_idx on t > (cost=0.14..0.29 rows=2 width=13) > Scan mode: Skip scan > Index Cond: (market = 'TEST'::text) > (3 rows) > > Notice that in the explain statement it shows correct number of rows > to be skipped. > Thanks for your test case; this is very helpful. For now, I would like to highlight that SET enable_indexskipscan = OFF can be used for testing with the patch applied. Dmitry and I will look at the feedback provided. Best regards, Jesper
> On Sat, Jun 1, 2019 at 6:57 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > > > I did a little bit of investigation and it seems to occur because in > > pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if > > there is an equality condition with a constant in the corresponding WHERE > > clause. > > ... > > However, the index skip scan interprets this as that it has to skip over just > > the first column. > > Right, passing correct number of columns fixes this particular problem. But > while debugging I've also discovered another related issue, when the current > implementation seems to have a few assumptions, that are not correct if we have > an index condition and a distinct column is not the first in the index. I'll > try to address these in a next version of the patch in the nearest future. So, as mentioned above, there were a few problems, namely the number of distinct_pathkeys with and without redundancy, and using _bt_search when the order of distinct columns doesn't match the index. As far as I can see the problem in the latter case (when we have an index condition) is that it's still possible to find a value, but lastItem value after the search is always zero (due to _bt_checkkeys filtering) and _bt_next stops right away. To address this, probably we can do something like in the attached patch. Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is the same, but without the constants elimination. It's being used then for getting the real number of distinct keys, and to check the order of the columns to not consider index skip scan if it's different. Hope it doesn't look too hacky. Also I've noticed, that the current implementation wouldn't work e.g. for: select distinct a, a from table; because in this case an IndexPath is hidden behind a ProjectionPath. For now I guess it's fine, but probably it's possible here to apply skip scan too.
Attachment
> Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
> the same, but without the constants elimination. It's being used then for
> getting the real number of distinct keys, and to check the order of the columns
> to not consider index skip scan if it's different. Hope it doesn't
> look too hacky.
>
This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency.
Hi, On 6/5/19 3:39 PM, Floris Van Nee wrote: > Thanks! I've verified that it works now. Here is a rebased version. > I was wondering if we're not too strict in some cases now though. Consider the following queries: [snip] > This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scancould be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clausethat 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement forconsistency. > I think it would be good to get more feedback on the patch in general before looking at further optimizations. We should of course fix any bugs that shows up. Thanks for your testing and feedback ! Best regards, Jesper
Attachment
I've previously noted upthread (along with several others), that I don't see a good reason to limit this new capability to index only scans. In addition to the reasons upthread, this also prevents using the new feature on physical replicas since index only scans require visibility map (IIRC) information that isn't safe to make assumptions about on a replica. That being said, it strikes me that this likely indicates an existing architecture issue. I was discussing the problem at PGCon with Andres and Heiki with respect to an index scan variation I've been working on myself. In short, it's not clear to me why we want index only scans and index scans to be entirely separate nodes, rather than optional variations within a broader index scan node. The problem becomes even more clear as we continue to add additional variants that lie on different axis, since we end up with an ever multiplying number of combinations. In that discussion no one could remember why it'd been done that way, but I'm planning to try to find the relevant threads in the archives to see if there's anything in particular blocking combining them. I generally dislike gating improvements like this on seemingly tangentially related refactors, but I will make the observation that adding the skip scan on top of such a refactored index scan node would make this a much more obvious and complete win. As I noted to Jesper at PGCon I'm happy to review the code in detail also, but likely won't get to it until later this week or next week at the earliest. Jesper: Is there anything still on your list of things to change about the patch? Or would now be a good time to look hard at the code? James Coleman
On Fri, 14 Jun 2019 at 04:32, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Here is a rebased version. Hi Jesper, I read over this thread a few weeks ago while travelling back from PGCon. (I wish I'd read it on the outward trip instead since it would have been good to talk about it in person.) First off. I think this is a pretty great feature. It certainly seems worthwhile working on it. I've looked over the patch just to get a feel for how the planner part works and I have a few ideas to share. The code in create_distinct_paths() I think should work a different way. I think it would be much better to add a new field to Path and allow a path to know what keys it is distinct for. This sort of goes back to an idea I thought about when developing unique joins (9c7f5229ad) about an easier way to detect fields that a relation is unique for. I've been calling these "UniqueKeys" in a few emails [1]. The idea was to tag these onto RelOptInfo to mention which columns or exprs a relation is unique by so that we didn't continuously need to look at unique indexes in all the places that call relation_has_unique_index_for(). The idea there was that unique joins would know when a join was unable to duplicate rows. If the outer side of a join didn't duplicate the inner side, then the join RelOptInfo could keep the UniqueKeys from the inner rel, and vice-versa. If both didn't duplicate then the join rel would obtain the UniqueKeys from both sides of the join. The idea here is that this would be a better way to detect unique joins, and also when it came to the grouping planner we'd also know if the distinct or group by should be a no-op. DISTINCT could be skipped, and GROUP BY could do a group aggregate without any sort. I think these UniqueKeys ties into this work, perhaps not adding UniqueKeys to RelOptInfo, but just to Path so that we create paths that have UniqueKeys during create_index_paths() based on some uniquekeys that are stored in PlannerInfo, similar to how we create index paths in build_index_paths() by checking if the index has_useful_pathkeys(). Doing it this way would open up more opportunities to use skip scans. For example, semi-joins and anti-joins could make use of them if the uniquekeys covered the entire join condition. With this idea, the code you've added in create_distinct_paths() can just search for the cheapest path that has the correct uniquekeys, or if none exist then just do the normal sort->unique or hash agg. I'm not entirely certain how we'd instruct a semi/anti joined relation to build such paths, but that seems like a problem that could be dealt with when someone does the work to allow skip scans to be used for those. Also, I'm not entirely sure that these UniqueKeys should make use of PathKey since there's no real need to know about pk_opfamily, pk_strategy, pk_nulls_first as those all just describe how the keys are ordered. We just need to know if they're distinct or not. All that's left after removing those fields is pk_eclass, so could UniqueKeys just be a list of EquivalenceClass? or perhaps even a Bitmapset with indexes into PlannerInfo->ec_classes (that might be premature for not since we've not yet got https://commitfest.postgresql.org/23/1984/ or https://commitfest.postgresql.org/23/2019/ ) However, if we did use PathKey, that does allow us to quickly check if the UniqueKeys are contained within the PathKeys, since pathkeys are canonical which allows us just to compare their memory address to know if two are equal, however, if you're storing eclasses we could probably get the same just by comparing the address of the eclass to the pathkey's pk_eclass. Otherwise, I think how you're making use of paths in create_distinct_paths() and create_skipscan_unique_path() kind of contradicts how they're meant to be used. I also agree with James that this should not be limited to Index Only Scans. From testing the patch, the following seems pretty strange to me: # create table abc (a int, b int, c int); CREATE TABLE # insert into abc select a,b,1 from generate_Series(1,1000) a, generate_Series(1,1000) b; INSERT 0 1000000 # create index on abc(a,b); CREATE INDEX # explain analyze select distinct on (a) a,b from abc order by a,b; -- this is fast. QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200 width=8) (actual time=0.260..20.518 rows=1000 loops=1) Scan mode: Skip scan Heap Fetches: 1000 Planning Time: 5.616 ms Execution Time: 21.791 ms (5 rows) # explain analyze select distinct on (a) a,b,c from abc order by a,b; -- Add one more column and it's slow. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Unique (cost=0.42..50104.43 rows=200 width=12) (actual time=1.201..555.280 rows=1000 loops=1) -> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43 rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000 loops=1) Planning Time: 0.102 ms Execution Time: 555.407 ms (4 rows) [1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi James, On 6/13/19 11:40 PM, James Coleman wrote: > I've previously noted upthread (along with several others), that I > don't see a good reason to limit this new capability to index only > scans. In addition to the reasons upthread, this also prevents using > the new feature on physical replicas since index only scans require > visibility map (IIRC) information that isn't safe to make assumptions > about on a replica. > > That being said, it strikes me that this likely indicates an existing > architecture issue. I was discussing the problem at PGCon with Andres > and Heiki with respect to an index scan variation I've been working on > myself. In short, it's not clear to me why we want index only scans > and index scans to be entirely separate nodes, rather than optional > variations within a broader index scan node. The problem becomes even > more clear as we continue to add additional variants that lie on > different axis, since we end up with an ever multiplying number of > combinations. > > In that discussion no one could remember why it'd been done that way, > but I'm planning to try to find the relevant threads in the archives > to see if there's anything in particular blocking combining them. > > I generally dislike gating improvements like this on seemingly > tangentially related refactors, but I will make the observation that > adding the skip scan on top of such a refactored index scan node would > make this a much more obvious and complete win. > Thanks for your feedback ! > As I noted to Jesper at PGCon I'm happy to review the code in detail > also, but likely won't get to it until later this week or next week at > the earliest. > > Jesper: Is there anything still on your list of things to change about > the patch? Or would now be a good time to look hard at the code? > It would be valuable to have test cases for your use-cases which works now, or should work. I revived Thomas' patch because it covered our use-cases and saw it as a much needed feature. Thanks again ! Best regards, Jesper
Hi David, On 6/14/19 3:19 AM, David Rowley wrote: > I read over this thread a few weeks ago while travelling back from > PGCon. (I wish I'd read it on the outward trip instead since it would > have been good to talk about it in person.) > > First off. I think this is a pretty great feature. It certainly seems > worthwhile working on it. > > I've looked over the patch just to get a feel for how the planner part > works and I have a few ideas to share. > > The code in create_distinct_paths() I think should work a different > way. I think it would be much better to add a new field to Path and > allow a path to know what keys it is distinct for. This sort of goes > back to an idea I thought about when developing unique joins > (9c7f5229ad) about an easier way to detect fields that a relation is > unique for. I've been calling these "UniqueKeys" in a few emails [1]. > The idea was to tag these onto RelOptInfo to mention which columns or > exprs a relation is unique by so that we didn't continuously need to > look at unique indexes in all the places that call > relation_has_unique_index_for(). The idea there was that unique joins > would know when a join was unable to duplicate rows. If the outer side > of a join didn't duplicate the inner side, then the join RelOptInfo > could keep the UniqueKeys from the inner rel, and vice-versa. If both > didn't duplicate then the join rel would obtain the UniqueKeys from > both sides of the join. The idea here is that this would be a better > way to detect unique joins, and also when it came to the grouping > planner we'd also know if the distinct or group by should be a no-op. > DISTINCT could be skipped, and GROUP BY could do a group aggregate > without any sort. > > I think these UniqueKeys ties into this work, perhaps not adding > UniqueKeys to RelOptInfo, but just to Path so that we create paths > that have UniqueKeys during create_index_paths() based on some > uniquekeys that are stored in PlannerInfo, similar to how we create > index paths in build_index_paths() by checking if the index > has_useful_pathkeys(). Doing it this way would open up more > opportunities to use skip scans. For example, semi-joins and > anti-joins could make use of them if the uniquekeys covered the entire > join condition. With this idea, the code you've added in > create_distinct_paths() can just search for the cheapest path that has > the correct uniquekeys, or if none exist then just do the normal > sort->unique or hash agg. I'm not entirely certain how we'd instruct > a semi/anti joined relation to build such paths, but that seems like a > problem that could be dealt with when someone does the work to allow > skip scans to be used for those. > > Also, I'm not entirely sure that these UniqueKeys should make use of > PathKey since there's no real need to know about pk_opfamily, > pk_strategy, pk_nulls_first as those all just describe how the keys > are ordered. We just need to know if they're distinct or not. All > that's left after removing those fields is pk_eclass, so could > UniqueKeys just be a list of EquivalenceClass? or perhaps even a > Bitmapset with indexes into PlannerInfo->ec_classes (that might be > premature for not since we've not yet got > https://commitfest.postgresql.org/23/1984/ or > https://commitfest.postgresql.org/23/2019/ ) However, if we did use > PathKey, that does allow us to quickly check if the UniqueKeys are > contained within the PathKeys, since pathkeys are canonical which > allows us just to compare their memory address to know if two are > equal, however, if you're storing eclasses we could probably get the > same just by comparing the address of the eclass to the pathkey's > pk_eclass. > > Otherwise, I think how you're making use of paths in > create_distinct_paths() and create_skipscan_unique_path() kind of > contradicts how they're meant to be used. > Thank you very much for this feedback ! Will need to revise the patch based on this. > I also agree with James that this should not be limited to Index Only > Scans. From testing the patch, the following seems pretty strange to > me: > > # create table abc (a int, b int, c int); > CREATE TABLE > # insert into abc select a,b,1 from generate_Series(1,1000) a, > generate_Series(1,1000) b; > INSERT 0 1000000 > # create index on abc(a,b); > CREATE INDEX > # explain analyze select distinct on (a) a,b from abc order by a,b; -- > this is fast. > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200 > width=8) (actual time=0.260..20.518 rows=1000 loops=1) > Scan mode: Skip scan > Heap Fetches: 1000 > Planning Time: 5.616 ms > Execution Time: 21.791 ms > (5 rows) > > > # explain analyze select distinct on (a) a,b,c from abc order by a,b; > -- Add one more column and it's slow. > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------ > Unique (cost=0.42..50104.43 rows=200 width=12) (actual > time=1.201..555.280 rows=1000 loops=1) > -> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43 > rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000 > loops=1) > Planning Time: 0.102 ms > Execution Time: 555.407 ms > (4 rows) > > [1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r > Ok, understood. I have put the CF entry into "Waiting on Author". Best regards, Jesper
> On Fri, Jun 14, 2019 at 9:20 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > The code in create_distinct_paths() I think should work a different > way. I think it would be much better to add a new field to Path and > allow a path to know what keys it is distinct for. This sort of goes > back to an idea I thought about when developing unique joins > (9c7f5229ad) about an easier way to detect fields that a relation is > unique for. I've been calling these "UniqueKeys" in a few emails [1]. > The idea was to tag these onto RelOptInfo to mention which columns or > exprs a relation is unique by so that we didn't continuously need to > look at unique indexes in all the places that call > relation_has_unique_index_for(). The idea there was that unique joins > would know when a join was unable to duplicate rows. If the outer side > of a join didn't duplicate the inner side, then the join RelOptInfo > could keep the UniqueKeys from the inner rel, and vice-versa. If both > didn't duplicate then the join rel would obtain the UniqueKeys from > both sides of the join. The idea here is that this would be a better > way to detect unique joins, and also when it came to the grouping > planner we'd also know if the distinct or group by should be a no-op. > DISTINCT could be skipped, and GROUP BY could do a group aggregate > without any sort. > > I think these UniqueKeys ties into this work, perhaps not adding > UniqueKeys to RelOptInfo, but just to Path so that we create paths > that have UniqueKeys during create_index_paths() based on some > uniquekeys that are stored in PlannerInfo, similar to how we create > index paths in build_index_paths() by checking if the index > has_useful_pathkeys(). Doing it this way would open up more > opportunities to use skip scans. For example, semi-joins and > anti-joins could make use of them if the uniquekeys covered the entire > join condition. With this idea, the code you've added in > create_distinct_paths() can just search for the cheapest path that has > the correct uniquekeys, or if none exist then just do the normal > sort->unique or hash agg. I'm not entirely certain how we'd instruct > a semi/anti joined relation to build such paths, but that seems like a > problem that could be dealt with when someone does the work to allow > skip scans to be used for those. > > Also, I'm not entirely sure that these UniqueKeys should make use of > PathKey since there's no real need to know about pk_opfamily, > pk_strategy, pk_nulls_first as those all just describe how the keys > are ordered. We just need to know if they're distinct or not. All > that's left after removing those fields is pk_eclass, so could > UniqueKeys just be a list of EquivalenceClass? or perhaps even a > Bitmapset with indexes into PlannerInfo->ec_classes (that might be > premature for not since we've not yet got > https://commitfest.postgresql.org/23/1984/ or > https://commitfest.postgresql.org/23/2019/ ) However, if we did use > PathKey, that does allow us to quickly check if the UniqueKeys are > contained within the PathKeys, since pathkeys are canonical which > allows us just to compare their memory address to know if two are > equal, however, if you're storing eclasses we could probably get the > same just by comparing the address of the eclass to the pathkey's > pk_eclass. Interesting, thanks for sharing this. > I also agree with James that this should not be limited to Index Only > Scans. From testing the patch, the following seems pretty strange to > me: > ... > explain analyze select distinct on (a) a,b from abc order by a,b; > explain analyze select distinct on (a) a,b,c from abc order by a,b; > ... Yes, but I believe this limitation is not intrinsic to the idea of the patch, and the very same approach can be used for IndexScan in the second example. I've already prepared a new version to enable it for IndexScan with minimal modifications, just need to rebase it on top of the latest changes and then can post it. Although still there would be some limitations I guess (e.g. the first thing I've stumbled upon is that an index scan with a filter wouldn't work well, because checking qual causes with a filter happens after ExecScanFetch)
> On Sun, Jun 16, 2019 at 5:03 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > I also agree with James that this should not be limited to Index Only > > Scans. From testing the patch, the following seems pretty strange to > > me: > > ... > > explain analyze select distinct on (a) a,b from abc order by a,b; > > explain analyze select distinct on (a) a,b,c from abc order by a,b; > > ... > > Yes, but I believe this limitation is not intrinsic to the idea of the patch, > and the very same approach can be used for IndexScan in the second example. > I've already prepared a new version to enable it for IndexScan with minimal > modifications, just need to rebase it on top of the latest changes and then > can post it. Although still there would be some limitations I guess (e.g. the > first thing I've stumbled upon is that an index scan with a filter wouldn't > work well, because checking qual causes with a filter happens after > ExecScanFetch) Here is what I was talking about, POC for an integration with index scan. About using of create_skipscan_unique_path and suggested planner improvements, I hope together with Jesper we can come up with something soon.
Attachment
Hi, On 6/19/19 9:57 AM, Dmitry Dolgov wrote: > Here is what I was talking about, POC for an integration with index scan. About > using of create_skipscan_unique_path and suggested planner improvements, I hope > together with Jesper we can come up with something soon. > I made some minor changes, but I did move all the code in create_distinct_paths() under enable_indexskipscan to limit the overhead if skip scan isn't enabled. Attached is v20, since the last patch should have been v19. Best regards, Jesper
Attachment
The following sql statement seems to have incorrect results - some logic in the backwards scan is currently not entirelyright. -Floris drop table if exists a; create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks); create index on a (a,b); analyze a; select distinct on (a) a,b from a order by a desc, b desc; explain select distinct on (a) a,b from a order by a desc, b desc; DROP TABLE CREATE TABLE INSERT 0 50000 CREATE INDEX ANALYZE a | b ---+------- 5 | 10000 5 | 1 4 | 1 3 | 1 2 | 1 1 | 1 (6 rows) QUERY PLAN --------------------------------------------------------------------------------- Index Only Scan Backward using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=8) Scan mode: Skip scan (2 rows)
> The following sql statement seems to have incorrect results - some logic in
> the backwards scan is currently not entirely right.
Thanks for testing! You're right, looks like in the current implementation in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.
On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee <florisvannee@optiver.com> wrote: > Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents onan IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated Iguess. Try the attached patch -- it has DEBUG1 traces with the contents of index tuples from key points during index scans, allowing you to see what's going on both as a B-Tree is descended, and as a range scan is performed. It also shows details of the insertion scankey that is set up within _bt_first(). This hasn't been adopted to the patch at all, so you'll probably need to do that. The patch should be considered a very rough hack, for now. It leaks memory like crazy. But I think that you'll find it helpful. -- Peter Geoghegan
Attachment
> index tuples from key points during index scans, allowing you to see
> what's going on both as a B-Tree is descended, and as a range scan is
> performed. It also shows details of the insertion scankey that is set
> up within _bt_first(). This hasn't been adopted to the patch at all,
> so you'll probably need to do that.
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at a custom location in the tree, so it just starts from the beginning and searches until it finds the first tuple (1,2).
Hi, I've done some initial review on v20 - just reading through the code, no tests at this point. Here are my comments: 1) config.sgml I'm not sure why the enable_indexskipscan section says This parameter requires that <varname>enable_indexonlyscan</varname> is <literal>on</literal>. I suppose it's the same thing as for enable_indexscan, and we don't say anything like that for that GUC. 2) indices.sgml The new section is somewhat unclear and difficult to understand, I think it'd deserve a rewording. Also, I wonder if we really should link to the wiki page about FSM problems. We have a couple of wiki links in the sgml docs, but those seem more generic while this seems as a development page that might disapper. But more importantly, that wiki page does not say anything about "Loose Index scans" so is it even the right wiki page? 3) nbtsearch.c _bt_skip - comments are formatted incorrectly _bt_update_skip_scankeys - missing comment _bt_scankey_within_page - missing comment 4) explain.c There are duplicate blocks of code for IndexScan and IndexOnlyScan: if (indexscan->skipPrefixSize > 0) { if (es->format != EXPLAIN_FORMAT_TEXT) ExplainPropertyInteger("Distinct Prefix", NULL, indexscan->skipPrefixSize, es); } I suggest we wrap this into a function ExplainIndexSkipScanKeys() or something like that. Also, there's this: if (((IndexScan *) plan)->skipPrefixSize > 0) { ExplainPropertyText("Scan mode", "Skip scan", es); } That does not make much sense - there's just a single 'scan mode' value. So I suggest we do the same thing as for unique joins, i.e. ExplainPropertyBool("Skip Scan", (((IndexScan *) plan)->skipPrefixSize > 0), es); 5) nodeIndexOnlyScan.c In ExecInitIndexOnlyScan, we should initialize the ioss_ fields a bit later, with the existing ones. This is just cosmetic issue, though. 6) nodeIndexScan.c I wonder why we even add and initialize the ioss_ fields for IndexScan nodes, when the skip scans require index-only scans? 7) pathnode.c I wonder how much was the costing discussed. It seems to me the logic is fairly similar to ideas discussed in the incremental sort patch, and we've been discussing some weak points there. I'm not sure how much we need to consider those issues here. 8) execnodes.h The comment before IndexScanState mentions new field NumDistinctKeys, but there's no such field added to the struct. 9) pathnodes.h I don't understand what the uniq_distinct_pathkeys comment says :-( 10) plannodes.h The naming of the new field (skipPrefixSize) added to IndexScan and IndexOnlyScan is clearly inconsistent with the naming convention of the existing fields. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > However, _bt_readpage messes things up, because it only reads tuples that > match all the provided keys (so where b=2) Right, the problem you've reported first had a similar origins. I'm starting to think that probably using _bt_readpage just like that is not exactly right thing to do, since the correct element is already found and there is no need to check if tuples are matching after one step back. I'll try to avoid it in the next version of patch. > I was wondering about something else: don't we also have another problem with > updating this current index tuple by skipping before calling > btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples > when scan->kill_prior_tuple is true. I'm not too familiar with the concept of > killing dead tuples while doing index scans, but by looking at the code it > seems to be possible that btgettuple returns a tuple, caller processes it and > sets kill_prior_tuple to true in order to have it killed. However, then the > skip scan kicks in, which sets the current tuple to a completely different > tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my > reasoning correct here or am I missing something? Need to check, but probably we can avoid that by setting kill_prior_tuple to false in case of skip scan as in index_rescan. > On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I've done some initial review on v20 - just reading through the code, no > tests at this point. Here are my comments: Thank you! > 2) indices.sgml > > The new section is somewhat unclear and difficult to understand, I think > it'd deserve a rewording. Also, I wonder if we really should link to the > wiki page about FSM problems. We have a couple of wiki links in the sgml > docs, but those seem more generic while this seems as a development page > that might disapper. But more importantly, that wiki page does not say > anything about "Loose Index scans" so is it even the right wiki page? Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already mentioned it too, so probably I'm going to remove it (and instead describe the idea in a few words in the documentation itself). > 6) nodeIndexScan.c > > I wonder why we even add and initialize the ioss_ fields for IndexScan > nodes, when the skip scans require index-only scans? Skip scans required index-only scans until recently, when the patch was updated to incorporate the same approach for index scans too. My apologies, looks like documentation and some commentaries are still inconsistent about this topic. > 7) pathnode.c > > I wonder how much was the costing discussed. It seems to me the logic is > fairly similar to ideas discussed in the incremental sort patch, and > we've been discussing some weak points there. I'm not sure how much we > need to consider those issues here. Can you please elaborate in a few words, which issues do you mean? Is it about non uniform distribution of distinct values? If so, I believe it's partially addressed when we have to skip too often, by searching a next index page. Although yeah, there is still an assumption about uniform distribution of distinct groups at the planning time. > 9) pathnodes.h > > I don't understand what the uniq_distinct_pathkeys comment says :-( Yeah, sorry, I'll try to improve the commentaries in the next version, where I'm going to address all the feedback.
On Mon, Jun 24, 2019 at 01:44:14PM +0200, Dmitry Dolgov wrote: >> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee@optiver.com> wrote: >> >> However, _bt_readpage messes things up, because it only reads tuples that >> match all the provided keys (so where b=2) > >Right, the problem you've reported first had a similar origins. I'm starting to >think that probably using _bt_readpage just like that is not exactly right >thing to do, since the correct element is already found and there is no need to >check if tuples are matching after one step back. I'll try to avoid it in the >next version of patch. > >> I was wondering about something else: don't we also have another problem with >> updating this current index tuple by skipping before calling >> btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples >> when scan->kill_prior_tuple is true. I'm not too familiar with the concept of >> killing dead tuples while doing index scans, but by looking at the code it >> seems to be possible that btgettuple returns a tuple, caller processes it and >> sets kill_prior_tuple to true in order to have it killed. However, then the >> skip scan kicks in, which sets the current tuple to a completely different >> tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my >> reasoning correct here or am I missing something? > >Need to check, but probably we can avoid that by setting kill_prior_tuple to >false in case of skip scan as in index_rescan. > >> On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> I've done some initial review on v20 - just reading through the code, no >> tests at this point. Here are my comments: > >Thank you! > >> 2) indices.sgml >> >> The new section is somewhat unclear and difficult to understand, I think >> it'd deserve a rewording. Also, I wonder if we really should link to the >> wiki page about FSM problems. We have a couple of wiki links in the sgml >> docs, but those seem more generic while this seems as a development page >> that might disapper. But more importantly, that wiki page does not say >> anything about "Loose Index scans" so is it even the right wiki page? > >Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already >mentioned it too, so probably I'm going to remove it (and instead describe the >idea in a few words in the documentation itself). > >> 6) nodeIndexScan.c >> >> I wonder why we even add and initialize the ioss_ fields for IndexScan >> nodes, when the skip scans require index-only scans? > >Skip scans required index-only scans until recently, when the patch was updated >to incorporate the same approach for index scans too. My apologies, looks like >documentation and some commentaries are still inconsistent about this topic. > Yes, if that's the case then various bits of docs and comments are rather misleading, ant fields in IndexScanState should be named 'iss_'. >> 7) pathnode.c >> >> I wonder how much was the costing discussed. It seems to me the logic is >> fairly similar to ideas discussed in the incremental sort patch, and >> we've been discussing some weak points there. I'm not sure how much we >> need to consider those issues here. > >Can you please elaborate in a few words, which issues do you mean? Is it about >non uniform distribution of distinct values? If so, I believe it's partially >addressed when we have to skip too often, by searching a next index page. >Although yeah, there is still an assumption about uniform distribution of >distinct groups at the planning time. > Right, it's mostly about what happens when the group sizes are not close to average size. The question is what happens in such cases - how much slower will the plan be, compared to "current" plan without a skip scan? I don't have a very good idea of the additional overhead associated with skip-scans - presumably it's a bit more expensive, right? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Attached is v20, since the last patch should have been v19. I took this for a quick spin today. The DISTINCT ON support is nice and I think it will be very useful. I've signed up to review it and will have more to say later. But today I had a couple of thoughts after looking into how src/backend/optimizer/plan/planagg.c works and wondering how to do some more skipping tricks with the existing machinery. 1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or AVG(DISTINCT ...) or any other aggregate). Right now you get a seq scan, with the sort/unique logic inside the Aggregate node. If you write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get a skip scan that is much faster in good cases. I suppose you could have a process_distinct_aggregates() in planagg.c that recognises queries of the right form and generates extra paths a bit like build_minmax_path() does. I think it's probably better to consider that in the grouping planner proper instead. I'm not sure. 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP BY i if you're allowed to go backwards. Those queries are equivalent to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC] (though as Floris noted, the backwards version gives the wrong answers with v20). That does seem like a much more specific thing applicable only to MIN and MAX, and I think preprocess_minmax_aggregates() could be taught to handle that sort of query, building an index only scan path with skip scan in build_minmax_path(). -- Thomas Munro https://enterprisedb.com
On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote: > I took this for a quick spin today. The DISTINCT ON support is nice > and I think it will be very useful. I've signed up to review it and > will have more to say later. But today I had a couple of thoughts > after looking into how src/backend/optimizer/plan/planagg.c works and > wondering how to do some more skipping tricks with the existing > machinery. > > 1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or > AVG(DISTINCT ...) or any other aggregate). Right now you get a seq > scan, with the sort/unique logic inside the Aggregate node. If you > write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get > a skip scan that is much faster in good cases. I suppose you could > have a process_distinct_aggregates() in planagg.c that recognises > queries of the right form and generates extra paths a bit like > build_minmax_path() does. I think it's probably better to consider > that in the grouping planner proper instead. I'm not sure. I think to make that happen we'd need to do a bit of an overhaul in nodeAgg.c to allow it to make use of presorted results instead of having the code blindly sort rows for each aggregate that has a DISTINCT or ORDER BY. The planner would also then need to start requesting paths with pathkeys that suit the aggregate and also probably dictate the order the AggRefs should be evaluated to allow all AggRefs to be evaluated that can be for each sort order. Once that part is done then the aggregates could then also request paths with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd need to be pretty careful with that one since there may be other parts of the query that require that all rows are fed in, not just 1 row per value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't just feed through 1 row for each "i" value, since we need only the ones that have "z > 0". Getting the first part of this solved is much more important than making skip scans work here, I'd say. I think we need to be able to walk before we can run with DISTINCT / ORDER BY aggs. > 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if > you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP > BY i if you're allowed to go backwards. Those queries are equivalent > to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC] > (though as Floris noted, the backwards version gives the wrong answers > with v20). That does seem like a much more specific thing applicable > only to MIN and MAX, and I think preprocess_minmax_aggregates() could > be taught to handle that sort of query, building an index only scan > path with skip scan in build_minmax_path(). For the MIN query you just need a path with Pathkeys: { i ASC, j ASC }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC. The more I think about these UniqueKeys, the more I think they need to be a separate concept to PathKeys. For example, UniqueKeys: { x, y } should be equivalent to { y, x }, but with PathKeys, that's not the case, since the order of each key matters. UniqueKeys equivalent version of pathkeys_contained_in() would not care about the order of individual keys, it would say things like, { a, b, c } is contained in { b, a }, since if the path is unique on columns { b, a } then it must also be unique on { a, b, c }. [1] https://www.postgresql.org/message-id/CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi Thomas and David, Thanks for the feedback ! On 7/2/19 8:27 AM, David Rowley wrote: > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote: >> I took this for a quick spin today. The DISTINCT ON support is nice >> and I think it will be very useful. I've signed up to review it and >> will have more to say later. But today I had a couple of thoughts >> after looking into how src/backend/optimizer/plan/planagg.c works and >> wondering how to do some more skipping tricks with the existing >> machinery. >> >> 1. SELECT COUNT(DISTINCT i) FROM t could benefit from this. (Or >> AVG(DISTINCT ...) or any other aggregate). Right now you get a seq >> scan, with the sort/unique logic inside the Aggregate node. If you >> write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get >> a skip scan that is much faster in good cases. I suppose you could >> have a process_distinct_aggregates() in planagg.c that recognises >> queries of the right form and generates extra paths a bit like >> build_minmax_path() does. I think it's probably better to consider >> that in the grouping planner proper instead. I'm not sure. > > I think to make that happen we'd need to do a bit of an overhaul in > nodeAgg.c to allow it to make use of presorted results instead of > having the code blindly sort rows for each aggregate that has a > DISTINCT or ORDER BY. The planner would also then need to start > requesting paths with pathkeys that suit the aggregate and also > probably dictate the order the AggRefs should be evaluated to allow > all AggRefs to be evaluated that can be for each sort order. Once > that part is done then the aggregates could then also request paths > with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd > need to be pretty careful with that one since there may be other parts > of the query that require that all rows are fed in, not just 1 row per > value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't > just feed through 1 row for each "i" value, since we need only the > ones that have "z > 0". Getting the first part of this solved is much > more important than making skip scans work here, I'd say. I think we > need to be able to walk before we can run with DISTINCT / ORDER BY > aggs. > I agree that the above is outside of scope for the first patch -- I think the goal should be the simple use-cases for IndexScan and IndexOnlyScan. Maybe we should expand [1] with possible cases, so we don't lose track of the ideas. >> 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if >> you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP >> BY i if you're allowed to go backwards. Those queries are equivalent >> to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC] >> (though as Floris noted, the backwards version gives the wrong answers >> with v20). That does seem like a much more specific thing applicable >> only to MIN and MAX, and I think preprocess_minmax_aggregates() could >> be taught to handle that sort of query, building an index only scan >> path with skip scan in build_minmax_path(). > > For the MIN query you just need a path with Pathkeys: { i ASC, j ASC > }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC. > Ok. > The more I think about these UniqueKeys, the more I think they need to > be a separate concept to PathKeys. For example, UniqueKeys: { x, y } > should be equivalent to { y, x }, but with PathKeys, that's not the > case, since the order of each key matters. UniqueKeys equivalent > version of pathkeys_contained_in() would not care about the order of > individual keys, it would say things like, { a, b, c } is contained in > { b, a }, since if the path is unique on columns { b, a } then it must > also be unique on { a, b, c }. > I'm looking at this, and will keep this in mind. Thanks ! [1] https://wiki.postgresql.org/wiki/Loose_indexscan Best regards, Jesper
On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote: > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote: > > The more I think about these UniqueKeys, the more I think they need to > be a separate concept to PathKeys. For example, UniqueKeys: { x, y } > should be equivalent to { y, x }, but with PathKeys, that's not the > case, since the order of each key matters. UniqueKeys equivalent > version of pathkeys_contained_in() would not care about the order of > individual keys, it would say things like, { a, b, c } is contained in > { b, a }, since if the path is unique on columns { b, a } then it must > also be unique on { a, b, c }. Is that actually true, though? I can see unique {a, b, c} => unique {a, b}, but for example: a | b | c --|---|-- 1 | 2 | 3 1 | 2 | 4 1 | 2 | 5 is unique on {a, b, c} but not on {a, b}, at least as I understand the way "unique" is used here, which is 3 distinct {a, b, c}, but only one {a, b}. Or I could be missing something obvious, and in that case, please ignore. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david@fetter.org> wrote: > > On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote: > > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote: > > > > The more I think about these UniqueKeys, the more I think they need to > > be a separate concept to PathKeys. For example, UniqueKeys: { x, y } > > should be equivalent to { y, x }, but with PathKeys, that's not the > > case, since the order of each key matters. UniqueKeys equivalent > > version of pathkeys_contained_in() would not care about the order of > > individual keys, it would say things like, { a, b, c } is contained in > > { b, a }, since if the path is unique on columns { b, a } then it must > > also be unique on { a, b, c }. > > Is that actually true, though? I can see unique {a, b, c} => unique > {a, b}, but for example: > > a | b | c > --|---|-- > 1 | 2 | 3 > 1 | 2 | 4 > 1 | 2 | 5 > > is unique on {a, b, c} but not on {a, b}, at least as I understand the > way "unique" is used here, which is 3 distinct {a, b, c}, but only one > {a, b}. > > Or I could be missing something obvious, and in that case, please > ignore. I think that example is the opposite direction of what David (Rowley) is saying. Unique on {a, b} implies unique on {a, b, c} while you're correct that the inverse doesn't hold. Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c} and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this different from pathkeys. James Coleman
On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331@gmail.com> wrote: > > On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david@fetter.org> wrote: > > > > On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote: > > > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote: > > > > > > The more I think about these UniqueKeys, the more I think they need to > > > be a separate concept to PathKeys. For example, UniqueKeys: { x, y } > > > should be equivalent to { y, x }, but with PathKeys, that's not the > > > case, since the order of each key matters. UniqueKeys equivalent > > > version of pathkeys_contained_in() would not care about the order of > > > individual keys, it would say things like, { a, b, c } is contained in > > > { b, a }, since if the path is unique on columns { b, a } then it must > > > also be unique on { a, b, c }. > > > > Is that actually true, though? I can see unique {a, b, c} => unique > > {a, b}, but for example: > > > > a | b | c > > --|---|-- > > 1 | 2 | 3 > > 1 | 2 | 4 > > 1 | 2 | 5 > > > > is unique on {a, b, c} but not on {a, b}, at least as I understand the > > way "unique" is used here, which is 3 distinct {a, b, c}, but only one > > {a, b}. > > > > Or I could be missing something obvious, and in that case, please > > ignore. > > I think that example is the opposite direction of what David (Rowley) > is saying. Unique on {a, b} implies unique on {a, b, c} while you're > correct that the inverse doesn't hold. > > Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c} > and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this > different from pathkeys. Yeah, exactly. A superset of the unique columns is still unique. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 04, 2019 at 10:06:11AM +1200, David Rowley wrote: > On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331@gmail.com> wrote: > > I think that example is the opposite direction of what David (Rowley) > > is saying. Unique on {a, b} implies unique on {a, b, c} while you're > > correct that the inverse doesn't hold. > > > > Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c} > > and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this > > different from pathkeys. > > Yeah, exactly. A superset of the unique columns is still unique. Thanks for clarifying! Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Jul 3, 2019 at 12:27 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote: > > 2. SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if > > you're allowed to go forwards. Same for SELECT i, MAX(j) FROM t GROUP > > BY i if you're allowed to go backwards. Those queries are equivalent > > to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC] > > (though as Floris noted, the backwards version gives the wrong answers > > with v20). That does seem like a much more specific thing applicable > > only to MIN and MAX, and I think preprocess_minmax_aggregates() could > > be taught to handle that sort of query, building an index only scan > > path with skip scan in build_minmax_path(). > > For the MIN query you just need a path with Pathkeys: { i ASC, j ASC > }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC. While updating the Loose Index Scan wiki page with links to other products' terminology on this subject, I noticed that MySQL can skip-scan MIN() and MAX() in the same query. Hmm. That seems quite desirable. I think it requires a new kind of skipping: I think you have to be able to skip to the first AND last key that has each distinct prefix, and then stick a regular agg on top to collapse them into one row. Such a path would not be so neatly describable by UniqueKeys, or indeed by the amskip() interface in the current patch. I mention all this stuff not because I want us to run before we can walk, but because to be ready to commit the basic distinct skip scan feature, I think we should know approximately how it'll handle the future stuff we'll need. -- Thomas Munro https://enterprisedb.com
> > On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > The following sql statement seems to have incorrect results - some logic in > > the backwards scan is currently not entirely right. > > Thanks for testing! You're right, looks like in the current implementation in > case of backwards scan there is one unnecessary extra step forward. It seems > this mistake was made, since I was concentrating only on the backward scans > with a cursor, and used not exactly correct approach to wrap up after a scan > was finished. Give me a moment, I'll tighten it up. Here is finally a new version of the patch, where all the mentioned issues seems to be fixed and the corresponding new tests should keep it like that (I've skipped all the pubs at PostgresLondon for that). Also I've addressed the most of feedback from Tomas, except the points about planning improvements (which is still in our todo list). By no means it's a final result (e.g. I guess `_bt_read_closest` must be improved), but I hope making progress step-by-step will help anyway. Also I've fixed some, how it is popular to say here, brain fade, where I mixed up scan directions. > On Tue, Jul 2, 2019 at 11:00 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: > > Attached is v20, since the last patch should have been v19. > > I took this for a quick spin today. The DISTINCT ON support is nice > and I think it will be very useful. I've signed up to review it and > will have more to say later. But today I had a couple of thoughts > after looking into how src/backend/optimizer/plan/planagg.c works and > wondering how to do some more skipping tricks with the existing > machinery. > > On Thu, Jul 4, 2019 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I mention all this stuff not because I want us to run before we can > walk, but because to be ready to commit the basic distinct skip scan > feature, I think we should know approximately how it'll handle the > future stuff we'll need. Great, thank you! I agree with Jesper that probably some parts of this are outside of scope for the first patch, but we definitely can take a look at what needs to be done to make the current implementation more flexible, so the follow up would be just natural.
Attachment
Hi, On 7/4/19 6:59 AM, Thomas Munro wrote: >> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC >> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC. > David, are you thinking about something like the attached ? Some questions. * Do you see UniqueKey as a "complete" planner node ? - I didn't update the nodes/*.c files for this yet * Is a UniqueKey with a list of EquivalenceClass best, or a list of UniqueKey with a single EquivalenceClass Likely more questions around this coming -- should this be a separate thread ? Based on this I'll start to update the v21 patch to use UniqueKey, and post a new version. > While updating the Loose Index Scan wiki page with links to other > products' terminology on this subject, I noticed that MySQL can > skip-scan MIN() and MAX() in the same query. Hmm. That seems quite > desirable. I think it requires a new kind of skipping: I think you > have to be able to skip to the first AND last key that has each > distinct prefix, and then stick a regular agg on top to collapse them > into one row. Such a path would not be so neatly describable by > UniqueKeys, or indeed by the amskip() interface in the current patch. > I mention all this stuff not because I want us to run before we can > walk, but because to be ready to commit the basic distinct skip scan > feature, I think we should know approximately how it'll handle the > future stuff we'll need. > Thomas, do you have any ideas for this ? I can see that MySQL did the functionality in two change sets (base and function support), but like you said we shouldn't paint ourselves into a corner. Feedback greatly appreciated. Best regards, Jesper
Attachment
On Wed, Jul 10, 2019 at 1:32 AM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > While updating the Loose Index Scan wiki page with links to other > > products' terminology on this subject, I noticed that MySQL can > > skip-scan MIN() and MAX() in the same query. Hmm. That seems quite > > desirable. I think it requires a new kind of skipping: I think you > > have to be able to skip to the first AND last key that has each > > distinct prefix, and then stick a regular agg on top to collapse them > > into one row. Such a path would not be so neatly describable by > > UniqueKeys, or indeed by the amskip() interface in the current patch. > > I mention all this stuff not because I want us to run before we can > > walk, but because to be ready to commit the basic distinct skip scan > > feature, I think we should know approximately how it'll handle the > > future stuff we'll need. > > Thomas, do you have any ideas for this ? I can see that MySQL did the > functionality in two change sets (base and function support), but like > you said we shouldn't paint ourselves into a corner. I think amskip() could be augmented by later patches to take a parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and SKIP_FIRST | SKIP_LAST (meaning please give me both). What we have in the current patch is SKIP_FIRST behaviour. Being able to choose SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i) ORDER BY i (ie forward scan for the order, but wanting the highest key for each distinct prefix), and being able to choose both allows for i, MIN(j), MAX(j) GROUP BY i. Earlier I thought that this scheme that allows you to ask for both might be incompatible with David's suggestion of tracking UniqueKeys in paths, but now I see that it isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys and therefore require a regular agg on top. I suspect that's OK: this min/max transformation stuff is highly specialised and can have whatever magic path-building logic is required in preprocess_minmax_aggregates(). -- Thomas Munro https://enterprisedb.com
Hi, On 7/9/19 10:14 PM, Thomas Munro wrote: >> Thomas, do you have any ideas for this ? I can see that MySQL did the >> functionality in two change sets (base and function support), but like >> you said we shouldn't paint ourselves into a corner. > > I think amskip() could be augmented by later patches to take a > parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and > SKIP_FIRST | SKIP_LAST (meaning please give me both). What we have in > the current patch is SKIP_FIRST behaviour. Being able to choose > SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i) > ORDER BY i (ie forward scan for the order, but wanting the highest key > for each distinct prefix), and being able to choose both allows for i, > MIN(j), MAX(j) GROUP BY i. Earlier I thought that this scheme that > allows you to ask for both might be incompatible with David's > suggestion of tracking UniqueKeys in paths, but now I see that it > isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys > and therefore require a regular agg on top. I suspect that's OK: this > min/max transformation stuff is highly specialised and can have > whatever magic path-building logic is required in > preprocess_minmax_aggregates(). > Ok, great. Thanks for your feedback ! Best regards, Jesper
> Here is finally a new version of the patch, where all the mentioned issues
> (I've skipped all the pubs at PostgresLondon for that).
Thanks for the new patch! Really appreciate the work you're putting into it.
I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:
- Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix
- Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix
- Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix
These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.
Also, I think in planner.c, line 4831, we should actually be looking at whether uniq_distinct_pathkeys is NIL, rather than the current check for distinct_pathkeys. That'll make the planner pick the skip scan even with queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. This leads to it scanning every item for a=2 even though it can stop after the first one.
I'll do some more tests with the patch.
-Floris
> On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > The more I think about these UniqueKeys, the more I think they need to > be a separate concept to PathKeys. For example, UniqueKeys: { x, y } > should be equivalent to { y, x }, but with PathKeys, that's not the > case, since the order of each key matters. UniqueKeys equivalent > version of pathkeys_contained_in() would not care about the order of > individual keys, it would say things like, { a, b, c } is contained in > { b, a }, since if the path is unique on columns { b, a } then it must > also be unique on { a, b, c }. > On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > David, are you thinking about something like the attached ? > > Some questions. > > * Do you see UniqueKey as a "complete" planner node ? > - I didn't update the nodes/*.c files for this yet > > * Is a UniqueKey with a list of EquivalenceClass best, or a list of > UniqueKey with a single EquivalenceClass Just for me to clarify, the idea is to replace PathKeys with a new concept of "UniqueKey" for skip scans, right? If I see it correctly, of course UniqueKeys { x, y } == UniqueKeys { y, x } from the result point of view, but the execution costs could be different due to different values distribution. In fact there are efforts to utilize this to produce more optimal order [1], but with UniqueKeys concept this information is lost. Obviously it's not something that could be immediately (or even never) a problem for skip scan functionality, but I guess it's still worth to point it out. > On Wed, Jul 10, 2019 at 4:40 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > I verified that the backwards index scan is indeed functioning now. However, > I'm afraid it's not that simple, as I think the cursor case is broken now. Thanks for testing! Could you provide a test case to show what exactly is the problem? [1]: https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru
> Thanks for testing! Could you provide a test case to show what exactly is the > problem? create table a (a int, b int, c int); insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks); create index on a (a,b); analyze a; set enable_indexskipscan=1; // setting this to 0 yields different results set random_page_cost=1; explain SELECT DISTINCT ON (a) a,b FROM a; BEGIN; DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a; FETCH FROM c; FETCH BACKWARD FROM c; FETCH 6 FROM c; FETCH BACKWARD 6 FROM c; FETCH 6 FROM c; FETCH BACKWARD 6 FROM c; END;
> On Wed, Jul 10, 2019 at 4:52 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > > Thanks for testing! Could you provide a test case to show what exactly is the > > problem? > > create table a (a int, b int, c int); > insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks); > create index on a (a,b); > analyze a; > > set enable_indexskipscan=1; // setting this to 0 yields different results > set random_page_cost=1; > explain SELECT DISTINCT ON (a) a,b FROM a; > > BEGIN; > DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a; > > FETCH FROM c; > FETCH BACKWARD FROM c; > > FETCH 6 FROM c; > FETCH BACKWARD 6 FROM c; > > FETCH 6 FROM c; > FETCH BACKWARD 6 FROM c; > > END; Ok, give me a moment, I'll check.
> Thanks for testing! Could you provide a test case to show what exactly is the > problem? Note that in the case of a regular non-skip scan, this cursor backwards works because the Unique node on top does not supportbackwards scanning at all. Therefore, when creating the cursor, the actual plan actually contains a Materialize nodeon top of the Unique+Index Scan nodes. The 'fetch backwards' never reaches the the index scan therefore, as it just fetchesstuff from the materialize node. -Floris
> On Wed, Jul 10, 2019 at 5:00 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > > > Thanks for testing! Could you provide a test case to show what exactly is the > > problem? > > Note that in the case of a regular non-skip scan, this cursor backwards works > because the Unique node on top does not support backwards scanning at all. > Therefore, when creating the cursor, the actual plan actually contains a > Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards' > never reaches the the index scan therefore, as it just fetches stuff from the > materialize node. Yeah, I'm aware. The last time when I was busy with cursors I've managed to make it work as I wanted, so at that time I decided to keep it like that, even though without skip scan it wasn't doing backwards.
On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote: > I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I thinkthe cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get thisworking properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There arebasically four cases: > > - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to thenext prefix > > - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to theprevious prefix > > - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to theprevious prefix > > - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to thenext prefix > > These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this threadabout being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounterthe same issues. Oh, right! So actually we already need the extra SKIP_FIRST/SKIP_LAST argument to amskip() that I theorised about, to support DISTINCT ON. Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or SKIP_MIN/SKIP_MAX. If we don't add support for that, we'll have to drop DISTINCT ON support, or use Materialize for some cases. My vote is: let's move forwards and add that parameter and make DISTINCT ON work. -- Thomas Munro https://enterprisedb.com
On Thu, 11 Jul 2019 at 14:50, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote: > > I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I thinkthe cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get thisworking properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There arebasically four cases: > > > > - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to thenext prefix > > > > - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' tothe previous prefix > > > > - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' tothe previous prefix > > > > - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' tothe next prefix > > > > These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this threadabout being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounterthe same issues. > > Oh, right! So actually we already need the extra SKIP_FIRST/SKIP_LAST > argument to amskip() that I theorised about, to support DISTINCT ON. > Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or > SKIP_MIN/SKIP_MAX. If we don't add support for that, we'll have to > drop DISTINCT ON support, or use Materialize for some cases. My vote > is: let's move forwards and add that parameter and make DISTINCT ON > work. Does it not just need to know the current direction of the cursor's scroll, then also the intended scan direction? For the general forward direction but for a backwards cursor scroll, we'd return the lowest value for each distinct prefix, but for the general backwards direction (DESC case) we'd return the highest value for each distinct prefix. Looking at IndexNext() the cursor direction seems to be estate->es_direction and the general scan direction is indicated by the plan's indexorderdir. Can't we just pass both of those to index_skip() to have it decide what to do? If we also pass in indexorderdir then index_skip() should know if it's to return the highest or lowest value, right? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> For the general forward direction but for a backwards cursor scroll,
> general backwards direction (DESC case) we'd return the highest value
> for each distinct prefix. Looking at IndexNext() the cursor direction
> seems to be estate->es_direction and the general scan direction is
> indicated by the plan's indexorderdir. Can't we just pass both of
> those to index_skip() to have it decide what to do? If we also pass in
> indexorderdir then index_skip() should know if it's to return the
> highest or lowest value, right?
SELECT DISTINCT ON (a) * FROM a WHERE b =2
On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee@optiver.com> wrote: > SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c) > Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping): > 1,1,1 > 1,1,2 > 1,2,2 > 1,2,3 > 2,2,1 > 2,2,3 > 3,1,1 > 3,1,2 > 3,2,2 > 3,2,3 > > Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementationof the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-oneuntil it finds a match. This match only comes at (3,1,2) however. Then it skips to the end. > > If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementationwould start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forwardone-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwardsto the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop assoon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all theway from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forwardto find (1,1,2). > Perhaps anyone can think of an easier way to implement it? One option is just don't implement it and just change ExecSupportsBackwardScan() so that it returns false for skip index scans, or perhaps at least implement an index am method to allow the planner to be able to determine if the index implementation supports it... amcanskipbackward -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 11 Jul 2019 at 02:48, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > > > The more I think about these UniqueKeys, the more I think they need to > > be a separate concept to PathKeys. For example, UniqueKeys: { x, y } > > should be equivalent to { y, x }, but with PathKeys, that's not the > > case, since the order of each key matters. UniqueKeys equivalent > > version of pathkeys_contained_in() would not care about the order of > > individual keys, it would say things like, { a, b, c } is contained in > > { b, a }, since if the path is unique on columns { b, a } then it must > > also be unique on { a, b, c }. > > > On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > > > David, are you thinking about something like the attached ? > > > > Some questions. > > > > * Do you see UniqueKey as a "complete" planner node ? > > - I didn't update the nodes/*.c files for this yet > > > > * Is a UniqueKey with a list of EquivalenceClass best, or a list of > > UniqueKey with a single EquivalenceClass > > Just for me to clarify, the idea is to replace PathKeys with a new concept of > "UniqueKey" for skip scans, right? If I see it correctly, of course > > UniqueKeys { x, y } == UniqueKeys { y, x } > > from the result point of view, but the execution costs could be different due > to different values distribution. In fact there are efforts to utilize this to > produce more optimal order [1], but with UniqueKeys concept this information is > lost. Obviously it's not something that could be immediately (or even never) a > problem for skip scan functionality, but I guess it's still worth to point it > out. The UniqueKeys idea is quite separate from pathkeys. Currently, a Path can have a List of PathKeys which define the order that the tuples will be read from the Plan node that's created from that Path. The idea with UniqueKeys is that a Path can also have a non-empty List of UniqueKeys to define that there will be no more than 1 row with the same value for the Paths UniqueKey column/exprs. As of now, if you look at standard_qp_callback() sets root->query_pathkeys, the idea here would be that the callback would also set a new List field named "query_uniquekeys" based on the group_pathkeys when non-empty and !root->query->hasAggs, or by using the distinct clause if it's non-empty. Then in build_index_paths() around the call to match_pathkeys_to_index() we'll probably also want to check if the index can support UniqueKeys that would suit the query_uniquekeys that were set during standard_qp_callback(). As for setting query_uniquekeys in standard_qp_callback(), this should be simply a matter of looping over either group_pathkeys or distinct_pathkeys and grabbing the pk_eclass from each key and making a canonical UniqueKey from that. To have these canonical you'll need to have a new field in PlannerInfo named canon_uniquekeys which will do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll need an equivalent of make_canonical_pathkey() in uniquekey.c With this implementation, the code that the patch adds in create_distinct_paths() can mostly disappear. You'd only need to look for a path that uniquekeys_contained_in() matches the root->query_uniquekeys and then just leave it to set_cheapest(distinct_rel); to pick the cheapest path. It would be wasted effort to create paths with UniqueKeys if there's multiple non-dead base rels in the query as the final rel in create_distinct_paths would be a join rel, so it might be worth checking that before creating such index paths in build_index_paths(). However, down the line, we could carry the uniquekeys forward into the join paths if the join does not duplicate rows from the other side of the join... That's future stuff though, not for this patch, I don't think. I think a UniqueKey can just be a struct similar to PathKey, e.g be located in pathnodes.h around where PathKey is defined. Likely we'll need a uniquekeys.c file that has the equivalent to pathkeys_contained_in() ... uniquekeys_contained_in(), which return true if arg1 is a superset of arg2 rather than check for one set being a prefix of another. As you mention above: UniqueKeys { x, y } == UniqueKeys { y, x }. That superset check could perhaps be optimized by sorting UniqueKey lists in memory address order, that'll save having a nested loop, but likely that's not going to be required for a first cut version. This would work since you'd want UniqueKeys to be canonical the same as PathKeys are (Notice that compare_pathkeys() only checks memory addresses of pathkeys and not equals()). I think the UniqueKey struct would only need to contain an EquivalenceClass field. I think all the other stuff that's in PathKey is irrelevant to UniqueKey. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On 7/11/19 7:38 AM, David Rowley wrote: > The UniqueKeys idea is quite separate from pathkeys. Currently, a > Path can have a List of PathKeys which define the order that the > tuples will be read from the Plan node that's created from that Path. > The idea with UniqueKeys is that a Path can also have a non-empty List > of UniqueKeys to define that there will be no more than 1 row with the > same value for the Paths UniqueKey column/exprs. > > As of now, if you look at standard_qp_callback() sets > root->query_pathkeys, the idea here would be that the callback would > also set a new List field named "query_uniquekeys" based on the > group_pathkeys when non-empty and !root->query->hasAggs, or by using > the distinct clause if it's non-empty. Then in build_index_paths() > around the call to match_pathkeys_to_index() we'll probably also want > to check if the index can support UniqueKeys that would suit the > query_uniquekeys that were set during standard_qp_callback(). > > As for setting query_uniquekeys in standard_qp_callback(), this should > be simply a matter of looping over either group_pathkeys or > distinct_pathkeys and grabbing the pk_eclass from each key and making > a canonical UniqueKey from that. To have these canonical you'll need > to have a new field in PlannerInfo named canon_uniquekeys which will > do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll > need an equivalent of make_canonical_pathkey() in uniquekey.c > > With this implementation, the code that the patch adds in > create_distinct_paths() can mostly disappear. You'd only need to look > for a path that uniquekeys_contained_in() matches the > root->query_uniquekeys and then just leave it to > set_cheapest(distinct_rel); to pick the cheapest path. > > It would be wasted effort to create paths with UniqueKeys if there's > multiple non-dead base rels in the query as the final rel in > create_distinct_paths would be a join rel, so it might be worth > checking that before creating such index paths in build_index_paths(). > However, down the line, we could carry the uniquekeys forward into the > join paths if the join does not duplicate rows from the other side of > the join... That's future stuff though, not for this patch, I don't > think. > > I think a UniqueKey can just be a struct similar to PathKey, e.g be > located in pathnodes.h around where PathKey is defined. Likely we'll > need a uniquekeys.c file that has the equivalent to > pathkeys_contained_in() ... uniquekeys_contained_in(), which return > true if arg1 is a superset of arg2 rather than check for one set being > a prefix of another. As you mention above: UniqueKeys { x, y } == > UniqueKeys { y, x }. That superset check could perhaps be optimized > by sorting UniqueKey lists in memory address order, that'll save > having a nested loop, but likely that's not going to be required for a > first cut version. This would work since you'd want UniqueKeys to be > canonical the same as PathKeys are (Notice that compare_pathkeys() > only checks memory addresses of pathkeys and not equals()). > > I think the UniqueKey struct would only need to contain an > EquivalenceClass field. I think all the other stuff that's in PathKey > is irrelevant to UniqueKey. > Thanks for the feedback ! I'll work on these changes for the next uniquekey patch. Best regards, Jesper
Hi David, On 7/11/19 7:38 AM, David Rowley wrote: > The UniqueKeys idea is quite separate from pathkeys. Currently, a > Path can have a List of PathKeys which define the order that the > tuples will be read from the Plan node that's created from that Path. > The idea with UniqueKeys is that a Path can also have a non-empty List > of UniqueKeys to define that there will be no more than 1 row with the > same value for the Paths UniqueKey column/exprs. > > As of now, if you look at standard_qp_callback() sets > root->query_pathkeys, the idea here would be that the callback would > also set a new List field named "query_uniquekeys" based on the > group_pathkeys when non-empty and !root->query->hasAggs, or by using > the distinct clause if it's non-empty. Then in build_index_paths() > around the call to match_pathkeys_to_index() we'll probably also want > to check if the index can support UniqueKeys that would suit the > query_uniquekeys that were set during standard_qp_callback(). > > As for setting query_uniquekeys in standard_qp_callback(), this should > be simply a matter of looping over either group_pathkeys or > distinct_pathkeys and grabbing the pk_eclass from each key and making > a canonical UniqueKey from that. To have these canonical you'll need > to have a new field in PlannerInfo named canon_uniquekeys which will > do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll > need an equivalent of make_canonical_pathkey() in uniquekey.c > > With this implementation, the code that the patch adds in > create_distinct_paths() can mostly disappear. You'd only need to look > for a path that uniquekeys_contained_in() matches the > root->query_uniquekeys and then just leave it to > set_cheapest(distinct_rel); to pick the cheapest path. > > It would be wasted effort to create paths with UniqueKeys if there's > multiple non-dead base rels in the query as the final rel in > create_distinct_paths would be a join rel, so it might be worth > checking that before creating such index paths in build_index_paths(). > However, down the line, we could carry the uniquekeys forward into the > join paths if the join does not duplicate rows from the other side of > the join... That's future stuff though, not for this patch, I don't > think. > > I think a UniqueKey can just be a struct similar to PathKey, e.g be > located in pathnodes.h around where PathKey is defined. Likely we'll > need a uniquekeys.c file that has the equivalent to > pathkeys_contained_in() ... uniquekeys_contained_in(), which return > true if arg1 is a superset of arg2 rather than check for one set being > a prefix of another. As you mention above: UniqueKeys { x, y } == > UniqueKeys { y, x }. That superset check could perhaps be optimized > by sorting UniqueKey lists in memory address order, that'll save > having a nested loop, but likely that's not going to be required for a > first cut version. This would work since you'd want UniqueKeys to be > canonical the same as PathKeys are (Notice that compare_pathkeys() > only checks memory addresses of pathkeys and not equals()). > > I think the UniqueKey struct would only need to contain an > EquivalenceClass field. I think all the other stuff that's in PathKey > is irrelevant to UniqueKey. > Here is a patch more in that direction. Some questions: 1) Do we really need the UniqueKey struct ? If it only contains the EquivalenceClass field then we could just have a list of those instead. That would make the patch simpler. 2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently the patch only uses canon_uniquekeys because the we attach the list directly on the Path node. I'll clean the patch up based on your feedback, and then start to rebase v21 on it. Thanks in advance ! Best regards, Jesper
Attachment
> On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote: > I verified that the backwards index scan is indeed functioning now. However, > I'm afraid it's not that simple, as I think the cursor case is broken now. I > think having just the 'scan direction' in the btree code is not enough to get > this working properly, because we need to know whether we want the minimum or > maximum element of a certain prefix. There are basically four cases: > > - Forward Index Scan + Forward cursor: we want the minimum element within a > prefix and we want to skip 'forward' to the next prefix > > - Forward Index Scan + Backward cursor: we want the minimum element within a > prefix and we want to skip 'backward' to the previous prefix > > - Backward Index Scan + Forward cursor: we want the maximum element within a > prefix and we want to skip 'backward' to the previous prefix > > - Backward Index Scan + Backward cursor: we want the maximum element within a > prefix and we want to skip 'forward' to the next prefix > > These cases make it rather complicated unfortunately. They do somewhat tie in > with the previous discussion on this thread about being able to skip to the > min or max value. If we ever want to support a sort of minmax scan, we'll > encounter the same issues. Yes, these four cases are indeed a very good point. I've prepared a new version of the patch, where they + an index condition and handling of situations when it eliminated one or more unique elements are addressed. It seems fixes issues and works also for those hypothetical examples you've mentioned above, but of course it looks pretty complicated and I need to polish it a bit before posting. > On Thu, Jul 11, 2019 at 12:13 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee@optiver.com> wrote: > > SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c) > > Data (imagine every tuple here actually occurs 10.000 times in the index to > > see the benefit of skipping): > > 1,1,1 > > 1,1,2 > > 1,2,2 > > 1,2,3 > > 2,2,1 > > 2,2,3 > > 3,1,1 > > 3,1,2 > > 3,2,2 > > 3,2,3 > > > > Creating a cursor on this query and then moving forward, you should get > > (1,1,2), (3,1,2). In the current implementation of the patch, after > > bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves > > forward one-by-one until it finds a match. This match only comes at (3,1,2) > > however. Then it skips to the end. > > > > If you move the cursor backwards from the end of the cursor, you should > > still get (3,1,2) (1,1,2). A possible implementation would start at the end > > and do a skip to the beginning of the prefix: (3,1,1). Then it needs to > > move forward one-by-one in order to find the first matching (minimum) item > > (3,1,2). When it finds it, it needs to skip backwards to the beginning of > > prefix 2 (2,2,1). It needs to move forwards to find the minimum element, > > but should stop as soon as it detects that the prefix doesn't match anymore > > (because there is no match for prefix 2, it will move all the way from > > (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of > > prefix 1: (1,1,1) and scan forward to find (1,1,2). > > Perhaps anyone can think of an easier way to implement it? > > One option is just don't implement it and just change > ExecSupportsBackwardScan() so that it returns false for skip index > scans, or perhaps at least implement an index am method to allow the > planner to be able to determine if the index implementation supports > it... amcanskipbackward Yep, it was discussed few times in this thread, and after we've discovered (thanks to Floris) so many issues I was also one step away from implementing this idea. But at the time time as Thomas correctly noticed, our implementation needs to be extensible to handle future use cases, and this particular cursor juggling seems already like a pretty good example of such "future use case". So I hope by dealing with it we can also figure out what needs to be extensible. > On Tue, Jul 16, 2019 at 6:53 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > Here is a patch more in that direction. > > Some questions: > > 1) Do we really need the UniqueKey struct ? If it only contains the > EquivalenceClass field then we could just have a list of those instead. > That would make the patch simpler. > > 2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently > the patch only uses canon_uniquekeys because the we attach the list > directly on the Path node. > > I'll clean the patch up based on your feedback, and then start to rebase > v21 on it. Thanks! I'll also take a look as soon an I'm finished with the last updates.
On Wed, 17 Jul 2019 at 04:53, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Here is a patch more in that direction. Thanks. I've just had a look over this and it roughly what I have in mind. Here are the comments I noted down during the review: cost_index: I know you've not finished here, but I think it'll need to adjust tuples_fetched somehow to account for estimate_num_groups() on the Path's unique keys. Any Eclass with an ec_has_const = true does not need to be part of the estimate there as there can only be at most one value for these. For example, in a query such as: SELECT x,y FROM t WHERE x = 1 GROUP BY x,y; you only need to perform estimate_num_groups() on "y". I'm really not quite sure on what exactly will be required from amcostestimate() here. The cost of the skip scan is not the same as the normal scan. So other that API needs adjusted to allow the caller to mention that we want skip scans estimated, or there needs to be another callback. build_index_paths: I don't quite see where you're checking if the query's unique_keys match what unique keys can be produced by the index. This is done for pathkeys with: pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN && !found_lower_saop_clause && has_useful_pathkeys(root, rel)); index_is_ordered = (index->sortopfamily != NULL); if (index_is_ordered && pathkeys_possibly_useful) { index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection); useful_pathkeys = truncate_useless_pathkeys(root, rel, index_pathkeys); orderbyclauses = NIL; orderbyclausecols = NIL; } Here has_useful_pathkeys() checks if the query requires some ordering. For unique keys you'll want to do the same. You'll have set the query's unique key requirements in standard_qp_callback(). I think basically build_index_paths() should be building index paths with unique keys, for all indexes that can support the query's unique keys. I'm just a bit uncertain if we need to create both a normal index path and another path for the same index with unique keys. Perhaps we can figure that out down the line somewhere. Maybe it's best to build path types for now, when possible, and we can consider later if we can skip the non-uniquekey paths. Likely that would require a big XXX comment to explain we need to review that before the code makes it into core. get_uniquekeys_for_index: I think this needs to follow more the lead from build_index_pathkeys. Basically, ask the index what it's pathkeys are. standard_qp_callback: build_group_uniquekeys & build_distinct_uniquekeys could likely be one function that takes a list of SortGroupClause. You just either pass the groupClause or distinctClause in. Pretty much the UniqueKey version of make_pathkeys_for_sortclauses(). > Some questions: > > 1) Do we really need the UniqueKey struct ? If it only contains the > EquivalenceClass field then we could just have a list of those instead. > That would make the patch simpler. I dunno about that. I understand it looks a bit pointless due to just having one field, but perhaps we can worry about that later. If we choose to ditch it and replace it with just an EquivalenceClass then we can do that later. > 2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently > the patch only uses canon_uniquekeys because the we attach the list > directly on the Path node. canon_uniquekeys should store at most one UniqueKey per EquivalenceClass. The reason for this is for fast comparison. We can compare memory addresses rather than checking individual fields are equal. Now... yeah it's true that there is only one field so far and we could just check the pointers are equal on the EquivalenceClasses, but I think maybe this is in the same boat as #1. Let's do it for now so we're sticking as close to the guidelines laid out by PathKeys and once it's all working and plugged into skip scans then we can decide if it needs a simplification pass over the code. > I'll clean the patch up based on your feedback, and then start to rebase > v21 on it. Cool. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 7/22/19 1:44 AM, David Rowley wrote: > Here are the comments I noted down during the review: > > cost_index: > > I know you've not finished here, but I think it'll need to adjust > tuples_fetched somehow to account for estimate_num_groups() on the > Path's unique keys. Any Eclass with an ec_has_const = true does not > need to be part of the estimate there as there can only be at most one > value for these. > > For example, in a query such as: > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y; > > you only need to perform estimate_num_groups() on "y". > > I'm really not quite sure on what exactly will be required from > amcostestimate() here. The cost of the skip scan is not the same as > the normal scan. So other that API needs adjusted to allow the caller > to mention that we want skip scans estimated, or there needs to be > another callback. > I think this part will become more clear once the index skip scan patch is rebased, as we got the uniquekeys field on the Path, and the indexskipprefixy info on the IndexPath node. > build_index_paths: > > I don't quite see where you're checking if the query's unique_keys > match what unique keys can be produced by the index. This is done for > pathkeys with: > > pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN && > !found_lower_saop_clause && > has_useful_pathkeys(root, rel)); > index_is_ordered = (index->sortopfamily != NULL); > if (index_is_ordered && pathkeys_possibly_useful) > { > index_pathkeys = build_index_pathkeys(root, index, > ForwardScanDirection); > useful_pathkeys = truncate_useless_pathkeys(root, rel, > index_pathkeys); > orderbyclauses = NIL; > orderbyclausecols = NIL; > } > > Here has_useful_pathkeys() checks if the query requires some ordering. > For unique keys you'll want to do the same. You'll have set the > query's unique key requirements in standard_qp_callback(). > > I think basically build_index_paths() should be building index paths > with unique keys, for all indexes that can support the query's unique > keys. I'm just a bit uncertain if we need to create both a normal > index path and another path for the same index with unique keys. > Perhaps we can figure that out down the line somewhere. Maybe it's > best to build path types for now, when possible, and we can consider > later if we can skip the non-uniquekey paths. Likely that would > require a big XXX comment to explain we need to review that before the > code makes it into core. > > get_uniquekeys_for_index: > > I think this needs to follow more the lead from build_index_pathkeys. > Basically, ask the index what it's pathkeys are. > > standard_qp_callback: > > build_group_uniquekeys & build_distinct_uniquekeys could likely be one > function that takes a list of SortGroupClause. You just either pass > the groupClause or distinctClause in. Pretty much the UniqueKey > version of make_pathkeys_for_sortclauses(). > Yeah, I'll move this part of the index skip scan patch to the unique key patch. Thanks for your feedback ! Best regards, Jesper
> On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > On 7/22/19 1:44 AM, David Rowley wrote: > > Here are the comments I noted down during the review: > > > > cost_index: > > > > I know you've not finished here, but I think it'll need to adjust > > tuples_fetched somehow to account for estimate_num_groups() on the > > Path's unique keys. Any Eclass with an ec_has_const = true does not > > need to be part of the estimate there as there can only be at most one > > value for these. > > > > For example, in a query such as: > > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y; > > > > you only need to perform estimate_num_groups() on "y". > > > > I'm really not quite sure on what exactly will be required from > > amcostestimate() here. The cost of the skip scan is not the same as > > the normal scan. So other that API needs adjusted to allow the caller > > to mention that we want skip scans estimated, or there needs to be > > another callback. > > > > I think this part will become more clear once the index skip scan patch > is rebased, as we got the uniquekeys field on the Path, and the > indexskipprefixy info on the IndexPath node. Here is what I came up with to address the problems, mentioned above in this thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key, that passes an index condition is few pages away - I'll try to tackle it soon). Just another small step forward, but I hope it's enough to rebase on top of it planner changes. Also I've added few tags, mostly to mention reviewers contribution.
Attachment
Hello. At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com> > > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > > > On 7/22/19 1:44 AM, David Rowley wrote: > > > Here are the comments I noted down during the review: > > > > > > cost_index: > > > > > > I know you've not finished here, but I think it'll need to adjust > > > tuples_fetched somehow to account for estimate_num_groups() on the > > > Path's unique keys. Any Eclass with an ec_has_const = true does not > > > need to be part of the estimate there as there can only be at most one > > > value for these. > > > > > > For example, in a query such as: > > > > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y; > > > > > > you only need to perform estimate_num_groups() on "y". > > > > > > I'm really not quite sure on what exactly will be required from > > > amcostestimate() here. The cost of the skip scan is not the same as > > > the normal scan. So other that API needs adjusted to allow the caller > > > to mention that we want skip scans estimated, or there needs to be > > > another callback. > > > > > > > I think this part will become more clear once the index skip scan patch > > is rebased, as we got the uniquekeys field on the Path, and the > > indexskipprefixy info on the IndexPath node. > > Here is what I came up with to address the problems, mentioned above in this > thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it > occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key, > that passes an index condition is few pages away - I'll try to tackle it soon). > Just another small step forward, but I hope it's enough to rebase on top of it > planner changes. > > Also I've added few tags, mostly to mention reviewers contribution. I have some comments. + * The order of columns in the index should be the same, as for + * unique distincs pathkeys, otherwise we cannot use _bt_search + * in the skip implementation - this can lead to a missing + * records. It seems that it is enough that distinct pathkeys is contained in index pathkeys. If it's right, that is almost checked in existing code: > if (pathkeys_contained_in(needed_pathkeys, path->pathkeys)) It is perfect when needed_pathkeys is distinct_pathkeys. So additional check is required if and only if it is not the case. > if (enable_indexskipscan && > IsA(path, IndexPath) && > ((IndexPath *) path)->indexinfo->amcanskip && > (path->pathtype == T_IndexOnlyScan || > path->pathtype == T_IndexScan) && > (needed_pathkeys == root->distinct_pathkeys || > pathkeys_contained_in(root->distinct_pathkeys, > path->pathkeys))) path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so the check against them isn't needed. If you have concern on that, we can add that as Assert(). I feel uncomfortable to look into indexinfo there. Couldnd't we use indexskipprefix == -1 to signal !amcanskip from create_index_path? + /* + * XXX: In case of index scan quals evaluation happens after + * ExecScanFetch, which means skip results could be fitered out + */ Why can't we use skipscan path if having filter condition? If something bad happens, the reason must be written here instead of what we do. + if (path->pathtype == T_IndexScan && + parse->jointree != NULL && + parse->jointree->quals != NULL && + ((List *)parse->jointree->quals)->length != 0) It's better to use list_length instead of peeping inside. It handles the NULL case as well. (The structure has recently changed but .length is not, though.) + * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point. It doesn't seem needed to me. Could you elaborate on the reason for that? + * If advancing direction is different from index direction, we must + * skip right away, but _bt_skip requires a starting point. + */ + if (direction * indexonlyscan->indexorderdir < 0 && + !node->ioss_FirstTupleEmitted) I'm confused by this. "direction" there is the physical scan direction (fwd/bwd) of index scan, which is already compensated by indexorderdir. Thus the condition means we do that when logical ordering (ASC/DESC) is DESC. (Though I'm not sure what "index direction" exactly means...) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry, there's a too-hard-to-read part. At Thu, 25 Jul 2019 20:17:37 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190725.201737.192223037.horikyota.ntt@gmail.com> > Hello. > > At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com> > > > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > > > > > On 7/22/19 1:44 AM, David Rowley wrote: > > > > Here are the comments I noted down during the review: > > > > > > > > cost_index: > > > > > > > > I know you've not finished here, but I think it'll need to adjust > > > > tuples_fetched somehow to account for estimate_num_groups() on the > > > > Path's unique keys. Any Eclass with an ec_has_const = true does not > > > > need to be part of the estimate there as there can only be at most one > > > > value for these. > > > > > > > > For example, in a query such as: > > > > > > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y; > > > > > > > > you only need to perform estimate_num_groups() on "y". > > > > > > > > I'm really not quite sure on what exactly will be required from > > > > amcostestimate() here. The cost of the skip scan is not the same as > > > > the normal scan. So other that API needs adjusted to allow the caller > > > > to mention that we want skip scans estimated, or there needs to be > > > > another callback. > > > > > > > > > > I think this part will become more clear once the index skip scan patch > > > is rebased, as we got the uniquekeys field on the Path, and the > > > indexskipprefixy info on the IndexPath node. > > > > Here is what I came up with to address the problems, mentioned above in this > > thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it > > occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key, > > that passes an index condition is few pages away - I'll try to tackle it soon). > > Just another small step forward, but I hope it's enough to rebase on top of it > > planner changes. > > > > Also I've added few tags, mostly to mention reviewers contribution. > > I have some comments. > > + * The order of columns in the index should be the same, as for > + * unique distincs pathkeys, otherwise we cannot use _bt_search > + * in the skip implementation - this can lead to a missing > + * records. > > It seems that it is enough that distinct pathkeys is contained in > index pathkeys. If it's right, that is almost checked in existing > code: > > > if (pathkeys_contained_in(needed_pathkeys, path->pathkeys)) > > It is perfect when needed_pathkeys is distinct_pathkeys. So > additional check is required if and only if it is not the case. So I think the following change will work. - + /* Consider index skip scan as well */ - + if (enable_indexskipscan && - + IsA(path, IndexPath) && - + ((IndexPath *) path)->indexinfo->amcanskip && - + (path->pathtype == T_IndexOnlyScan || - + path->pathtype == T_IndexScan) && - + root->distinct_pathkeys != NIL) + + if (enable_indexskipscan && + + IsA(path, IndexPath) && + + ((IndexPath *) path)->indexskipprefix >= 0 && + + (needed_pathkeys == root->distinct_pathkeys || + + pathkeys_contained_in(root->distinct_pathkeys, + + path->pathkeys))) Additional comments on the condition above are: > path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so > the check against them isn't needed. If you have concern on that, > we can add that as Assert(). > > I feel uncomfortable to look into indexinfo there. Couldnd't we > use indexskipprefix == -1 to signal !amcanskip from > create_index_path? > > > + /* > + * XXX: In case of index scan quals evaluation happens after > + * ExecScanFetch, which means skip results could be fitered out > + */ > > Why can't we use skipscan path if having filter condition? If > something bad happens, the reason must be written here instead of > what we do. > > > + if (path->pathtype == T_IndexScan && > + parse->jointree != NULL && > + parse->jointree->quals != NULL && > + ((List *)parse->jointree->quals)->length != 0) > > It's better to use list_length instead of peeping inside. It > handles the NULL case as well. (The structure has recently > changed but .length is not, though.) > > > + * If advancing direction is different from index direction, we must > + * skip right away, but _bt_skip requires a starting point. > > It doesn't seem needed to me. Could you elaborate on the reason > for that? > > > + * If advancing direction is different from index direction, we must > + * skip right away, but _bt_skip requires a starting point. > + */ > + if (direction * indexonlyscan->indexorderdir < 0 && > + !node->ioss_FirstTupleEmitted) > > I'm confused by this. "direction" there is the physical scan > direction (fwd/bwd) of index scan, which is already compensated > by indexorderdir. Thus the condition means we do that when > logical ordering (ASC/DESC) is DESC. (Though I'm not sure what > "index direction" exactly means...) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > I have some comments. Thank you for the review! > + * The order of columns in the index should be the same, as for > + * unique distincs pathkeys, otherwise we cannot use _bt_search > + * in the skip implementation - this can lead to a missing > + * records. > > It seems that it is enough that distinct pathkeys is contained in > index pathkeys. If it's right, that is almost checked in existing > code: Looks like you're right. After looking closely there seems to be an issue in the original implementation, when we use wrong prefix_size in such cases. Without this problem this condition is indeed enough. > > if (enable_indexskipscan && > > IsA(path, IndexPath) && > > ((IndexPath *) path)->indexinfo->amcanskip && > > (path->pathtype == T_IndexOnlyScan || > > path->pathtype == T_IndexScan) && > > (needed_pathkeys == root->distinct_pathkeys || > > pathkeys_contained_in(root->distinct_pathkeys, > > path->pathkeys))) > > path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so > the check against them isn't needed. If you have concern on that, > we can add that as Assert(). > > + if (path->pathtype == T_IndexScan && > + parse->jointree != NULL && > + parse->jointree->quals != NULL && > + ((List *)parse->jointree->quals)->length != 0) > > It's better to use list_length instead of peeping inside. It > handles the NULL case as well. (The structure has recently > changed but .length is not, though.) Yeah, will change both (hopefully soon) > + /* > + * XXX: In case of index scan quals evaluation happens after > + * ExecScanFetch, which means skip results could be fitered out > + */ > > Why can't we use skipscan path if having filter condition? If > something bad happens, the reason must be written here instead of > what we do. Sorry, looks like I've failed to express this more clear in the commentary. The point is that when an index scan (not for index only scan) has some conditions, their evaluation happens after skipping, and I don't see any not too much invasive way to apply skip correctly. > > + * If advancing direction is different from index direction, we must > + * skip right away, but _bt_skip requires a starting point. > > It doesn't seem needed to me. Could you elaborate on the reason > for that? This is needed for e.g. scan with a cursor backward without an index condition. E.g. if we have: 1 1 2 2 3 3 1 2 3 4 5 6 and do DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b; FETCH ALL FROM c; we should get 1 2 3 1 3 5 When afterwards we do FETCH BACKWARD ALL FROM c; we should get 3 2 1 5 2 1 If we will use _bt_next first time without _bt_skip, the first pair would be 3 6 (the first from the end of the tuples, not from the end of the cursor). > + * If advancing direction is different from index direction, we must > + * skip right away, but _bt_skip requires a starting point. > + */ > + if (direction * indexonlyscan->indexorderdir < 0 && > + !node->ioss_FirstTupleEmitted) > > I'm confused by this. "direction" there is the physical scan > direction (fwd/bwd) of index scan, which is already compensated > by indexorderdir. Thus the condition means we do that when > logical ordering (ASC/DESC) is DESC. (Though I'm not sure what > "index direction" exactly means...) I'm not sure I follow, what do you mean by compensated? In general you're right, as David Rowley mentioned above, indexorderdir is a general scan direction, and direction is flipped estate->es_direction, which is a cursor direction. The goal of this condition is catch when those two are different, and we need to advance and read in different directions.
Hello. On 2019/07/29 4:17, Dmitry Dolgov wrote:>> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Yeah, will change both (hopefully soon) Thanks. >> + /* >> + * XXX: In case of index scan quals evaluation happens after >> + * ExecScanFetch, which means skip results could be fitered out >> + */ >> >> Why can't we use skipscan path if having filter condition? If >> something bad happens, the reason must be written here instead of >> what we do. > > Sorry, looks like I've failed to express this more clear in the > commentary. The point is that when an index scan (not for index > only scan) has some conditions, their evaluation happens after > skipping, and I don't see any not too much invasive way to > apply skip correctly. Yeah, your explanation was perfect for me. What I failed to understand was what is expected to be done in the case. I reconsidered and understood that: For example, the following query: select distinct (a, b) a, b, c from t where c < 100; skip scan returns one tuple for one distinct set of (a, b) with arbitrary one of c, If the choosed c doesn't match the qual and there is any c that matches the qual, we miss that tuple. If this is correct, an explanation like the above might help. >> + * If advancing direction is different from index direction, we must >> + * skip right away, but _bt_skip requires a starting point. >> >> It doesn't seem needed to me. Could you elaborate on the reason >> for that? > > This is needed for e.g. scan with a cursor backward without an index condition. > E.g. if we have: > > 1 1 2 2 3 3 > 1 2 3 4 5 6 > > and do > > DECLARE c SCROLL CURSOR FOR > SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b; > > FETCH ALL FROM c; > > we should get > > 1 2 3 > 1 3 5 > > When afterwards we do > > FETCH BACKWARD ALL FROM c; > > we should get > > 3 2 1 > 5 2 1 > > > If we will use _bt_next first time without _bt_skip, the first pair would be > 3 6 (the first from the end of the tuples, not from the end of the cursor). Thanks for the explanation. Sorry, I somehow thought that that is right. You're right. >> + * If advancing direction is different from index direction, we must >> + * skip right away, but _bt_skip requires a starting point. >> + */ >> + if (direction * indexonlyscan->indexorderdir < 0 && >> + !node->ioss_FirstTupleEmitted) >> >> I'm confused by this. "direction" there is the physical scan >> direction (fwd/bwd) of index scan, which is already compensated >> by indexorderdir. Thus the condition means we do that when >> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what >> "index direction" exactly means...) > > I'm not sure I follow, what do you mean by compensated? In general you're I meant that the "direction" is already changed to physical order at the point. > right, as David Rowley mentioned above, indexorderdir is a general scan > direction, and direction is flipped estate->es_direction, which is a cursor > direction. The goal of this condition is catch when those two are different, > and we need to advance and read in different directions. Mmm. Sorry and thank you for the explanation. I was stupid. You're right. I perhaps mistook indexorderdir's meaning. Maybe something like the following will work *for me*:p | When we are fetching a cursor in backward direction, return the | tuples that forward fetching should have returned. In other | words, we return the last scanned tuple in a DISTINCT set. Skip | to that tuple before returning the first tuple. # Of course, I need someone to correct this! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > I feel uncomfortable to look into indexinfo there. Couldnd't we > use indexskipprefix == -1 to signal !amcanskip from > create_index_path? Looks like it's not that straightforward to do this only in create_index_path, since to make this decision we need to have both parts, indexinfo and distinct keys. > Yeah, your explanation was perfect for me. What I failed to > understand was what is expected to be done in the case. I > reconsidered and understood that: > > For example, the following query: > > select distinct (a, b) a, b, c from t where c < 100; > > skip scan returns one tuple for one distinct set of (a, b) with > arbitrary one of c, If the choosed c doesn't match the qual and > there is any c that matches the qual, we miss that tuple. > > If this is correct, an explanation like the above might help. Yes, that's correct, I've added this into commentaries. > Maybe something like the following will work *for me*:p > > | When we are fetching a cursor in backward direction, return the > | tuples that forward fetching should have returned. In other > | words, we return the last scanned tuple in a DISTINCT set. Skip > | to that tuple before returning the first tuple. And this too (slightly rewritten:). We will soon post the new version of patch with updates about UniqueKey from Jesper.
Hi, On 8/2/19 8:14 AM, Dmitry Dolgov wrote: > And this too (slightly rewritten:). We will soon post the new version of patch > with updates about UniqueKey from Jesper. > Yes. We decided to send this now, although there is still feedback from David that needs to be considered/acted on. The patches can be reviewed independently, but we will send them as a set from now on. Development of UniqueKey will be kept separate though [1]. Note, that while UniqueKey can form the foundation of optimizations for GROUP BY queries it isn't the focus for this patch series. Contributions are very welcomed of course. [1] https://github.com/jesperpedersen/postgres/tree/uniquekey Best regards, Jesper
Attachment
Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet.
In my previous review I mentioned that queries of the form:
select distinct on(a) a,b from a where a=1;
do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c
The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate:
-Floris
> On Mon, Aug 5, 2019 at 12:05 PM Floris Van Nee <florisvannee@optiver.com> > wrote: > > The root->distinct_pathkeys is already filtered for redundant keys, so column > 'a' is not in there anymore. Still, it'd be much faster to use the skip scan > here, because a regular scan will scan all entries with a=1, even though > we're really only interested in the first one. In previous versions, this > would be fixed by changing the check in planner.c to use > root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things > change a bit now that the patch is rebased on the unique-keys patch. Would it > be valid to change this check to root->query_uniquekeys != NIL to consider > skip scans also for this query? [including a commentary from Jesper] On Mon, Aug 5, 2019 at 6:55 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: Yes, the check should be for that. However, the query in question doesn't have any query_pathkeys, and hence query_uniquekeys in standard_qp_callback(), so therefore it isn't supported. Your scenario is covered by one of the test cases in case the functionality is supported. But, I think that is outside the scope of the current patch. > However, currently, the patch just reads the closest and then doesn't > consider this page at all anymore, if the first tuple skipped to turns out to > be not visible. Consider the following sql to illustrate: For the records, the purpose of `_bt_read_closest` is not so much to reduce amount of data we read from a page, but more to correctly handle those situations we were discussing before with reading forward/backward in cursors, since for that in some cases we need to remember previous values for stepping to the next. I've limited number of items, fetched in this function just because I was misled by having a check for dead tuples in `_bt_skip`. Of course we can modify it to read a whole page and leave visibility check for the code after `index_getnext_tid` (although in case if we know that all tuples on this page are visilbe I guess it's not strictly necessary, but we still get improvement from skipping itself).
> doesn't have any query_pathkeys, and hence query_uniquekeys in
> standard_qp_callback(), so therefore it isn't supported
> Your scenario is covered by one of the test cases in case the
> functionality is supported. But, I think that is outside the scope of
> the current patch.
Ah alright, thanks. That makes it clear why it doesn't work.
From a user point of view I think it's rather strange that
SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2
would give a fast skip scan, even though the more likely query that someone would write
SELECT DISTINCT ON (a) a,b FROM a WHERE a=2
would not.
It is something we could be leave up to the next patch though.
Something else I just noticed which I'm just writing here for awareness; I don't think it's that pressing at the moment and can be left to another patch. When there are multiple indices on a table the planner gets confused and doesn't select an index-only skip scan even though it could. I'm guessing it just takes the first available index based on the DISTINCT clause and then doesn't look further, eg.
With an index on (a,b) and (a,c,b):
postgres=# explain select distinct on (a) a,c,b FROM a;
QUERY PLAN
--------------------------------------------------------------------
Index Scan using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=12)
Skip scan mode: true
(2 rows)
-> This could be an index only scan with the (a,b,c) index.
> For the records, the purpose of `_bt_read_closest` is not so much to reduce
> amount of data we read from a page, but more to correctly handle those
> situations we were discussing before with reading forward/backward in cursors,
> since for that in some cases we need to remember previous values for stepping
> to the next. I've limited number of items, fetched in this function just
> because I was misled by having a check for dead tuples in `_bt_skip`. Of course
> we can modify it to read a whole page and leave visibility check for the code
> after `index_getnext_tid` (although in case if we know that all tuples on this
> page are visilbe I guess it's not strictly necessary, but we still get
> improvement from skipping itself).
I understand and I agree - primary purpose why we chose this function was to make it work correctly. I don't think it would be something for this patch to use the optimization of partially reading a page. My point was however, if this optimization was allowed in a future patch, it would have great performance benefits.
To fix the current patch, we'd indeed need to read the full page. It'd be good to take a close look at the implementation of this function then, because messing around with the previous/next is also not trivial. I think the current implementation also has a problem when the item that is skipped to, is the first item on the page. Eg. (this depends on page size)
> On Mon, Aug 5, 2019 at 10:38 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Of course we can modify it to read a whole page and leave visibility check > for the code after `index_getnext_tid` (although in case if we know that all > tuples on this page are visilbe I guess it's not strictly necessary, but we > still get improvement from skipping itself). Sorry for long delay. Here is more or less what I had in mind. After changing read_closest to read the whole page I couldn't resist to just merge it into readpage itself, since it's basically the same. It could raise questions about performance and how intrusive this patch is, but I hope it's not that much of a problem (in the worst case we can split it back). I've also added few tests for the issue you've mentioned. Thanks again, I'm appreciate how much efforts you put into reviewing!
Attachment
> read_closest to read the whole page I couldn't resist to just merge it into
> readpage itself, since it's basically the same. It could raise questions about>
> performance and how intrusive this patch is, but I hope it's not that much of a
> problem (in the worst case we can split it back). I've also added few tests for
> the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
> put into reviewing!
> On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > I'm afraid I did manage to find another incorrect query result though Yes, it's an example of what I was mentioning before, that the current modified implementation of `_bt_readpage` wouldn't work well in case of going between pages. So far it seems that the only problem we can have is when previous and next items located on a different pages. I've checked how this issue can be avoided, I hope I will post a new version relatively soon.
Surely it isn't right to add members prefixed with "ioss_" to struct IndexScanState. I'm surprised about this "FirstTupleEmitted" business. Wouldn't it make more sense to implement index_skip() to return the first tuple if the scan is just starting? (I know little about executor, apologies if this is a stupid question.) It would be good to get more knowledgeable people to review this patch. It's clearly something we want, yet it's been there for a very long time. Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Mon, Sep 2, 2019 at 3:28 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > > > I'm afraid I did manage to find another incorrect query result though > > Yes, it's an example of what I was mentioning before, that the current modified > implementation of `_bt_readpage` wouldn't work well in case of going between > pages. So far it seems that the only problem we can have is when previous and > next items located on a different pages. I've checked how this issue can be > avoided, I hope I will post a new version relatively soon. Here is the version in which stepping between the pages works better. It seems sufficient to fix the case you've mentioned before, but for that we need to propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I can't say I like this solution. I have an idea that maybe it would be simpler to teach the code after index_skip to not do `_bt_next` right after one skip happened before. It should immediately elliminate several hacks from index skip itself, so I'll try to pursue this idea. > On Wed, Sep 4, 2019 at 10:45 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: Thank you for checking it out! > Surely it isn't right to add members prefixed with "ioss_" to > struct IndexScanState. Yeah, sorry. I've incorporated IndexScan support originally only to show that it's possible (with some limitations), but after that forgot to clean up. Now those fields are renamed. > I'm surprised about this "FirstTupleEmitted" business. Wouldn't it make > more sense to implement index_skip() to return the first tuple if the > scan is just starting? (I know little about executor, apologies if this > is a stupid question.) I'm not entirely sure, which exactly part do you mean? Now the first tuple is returned by `_bt_first`, how would it help if index_skip will return it? > It would be good to get more knowledgeable people to review this patch. > It's clearly something we want, yet it's been there for a very long > time. Sure, that would be nice.
Attachment
On 2019-Sep-05, Dmitry Dolgov wrote: > Here is the version in which stepping between the pages works better. It seems > sufficient to fix the case you've mentioned before, but for that we need to > propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I > can't say I like this solution. I have an idea that maybe it would be simpler > to teach the code after index_skip to not do `_bt_next` right after one skip > happened before. It should immediately elliminate several hacks from index skip > itself, so I'll try to pursue this idea. Cool. I think multiplying two ScanDirections to watch for a negative result is pretty ugly: /* * If advancing direction is different from index direction, we must * skip right away, but _bt_skip requires a starting point. */ if (direction * indexscan->indexorderdir < 0 && !node->iss_FirstTupleEmitted) Surely there's a better way to code that? I think "scanstart" needs more documentation, both in the SGML docs as well as the code comments surrounding it. Please disregard my earlier comment about FirstTupleEmitted. I was thinking that index_skip would itself emit a tuple (ie. call some "getnext" internally) rather than just repositioning. There might still be some more convenient way to represent this, but I have no immediate advice. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Thu, Sep 5, 2019 at 9:41 PM Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> wrote: > > On 2019-Sep-05, Dmitry Dolgov wrote: > > > Here is the version in which stepping between the pages works better. It seems > > sufficient to fix the case you've mentioned before, but for that we need to > > propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I > > can't say I like this solution. I have an idea that maybe it would be simpler > > to teach the code after index_skip to not do `_bt_next` right after one skip > > happened before. It should immediately elliminate several hacks from index skip > > itself, so I'll try to pursue this idea. > > Cool. Here it is. Since now the code after index_skip knows whether to do index_getnext or not, it's possible to use unmodified `_bt_readpage` / `_bt_steppage`. To achieve that there is a flag that indicated whether or not we were skipping to the current item (I guess it's possible to implement it without such a flag, but the at the end result looked more ugly as for me). On the way I've simplified few things, and all the tests we accumulated before are still passing. I'm almost sure it's possible to implement some parts of the code more elegant, but don't see yet how. > I think multiplying two ScanDirections to watch for a negative result is > pretty ugly: Probably, but the only alternative I see to check if directions are opposite is to check that directions come in pairs (back, forth), (forth, back). Is there an easier way? > I think "scanstart" needs more documentation, both in the SGML docs as > well as the code comments surrounding it. I was able to remove it after another round of simplification.
Attachment
On 2019-Sep-22, Dmitry Dolgov wrote: > > I think multiplying two ScanDirections to watch for a negative result is > > pretty ugly: > > Probably, but the only alternative I see to check if directions are opposite is > to check that directions come in pairs (back, forth), (forth, back). Is there > an easier way? Maybe use the ^ operator? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190923020204.GA2781@alvherre.pgsql> > On 2019-Sep-22, Dmitry Dolgov wrote: > > > > I think multiplying two ScanDirections to watch for a negative result is > > > pretty ugly: > > > > Probably, but the only alternative I see to check if directions are opposite is > > to check that directions come in pairs (back, forth), (forth, back). Is there > > an easier way? > > Maybe use the ^ operator? It's not a logical operator but a bitwise arithmetic operator, which cannot be used if the operands is guaranteed to be 0 or 1 (in integer). In a-kind-of-standard, but hacky way, "(!a != !b)" works as desired since ! is a logical operator. Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it some good way. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 24 Sep 2019 17:35:47 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190924.173547.226622711.horikyota.ntt@gmail.com> > At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190923020204.GA2781@alvherre.pgsql> > > On 2019-Sep-22, Dmitry Dolgov wrote: > > > > > > I think multiplying two ScanDirections to watch for a negative result is > > > > pretty ugly: > > > > > > Probably, but the only alternative I see to check if directions are opposite is > > > to check that directions come in pairs (back, forth), (forth, back). Is there > > > an easier way? > > > > Maybe use the ^ operator? > > It's not a logical operator but a bitwise arithmetic operator, > which cannot be used if the operands is guaranteed to be 0 or 1 > (in integer). In a-kind-of-standard, but hacky way, "(!a != !b)" > works as desired since ! is a logical operator. > > Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it > some good way. Sorry, it's not a boolean. A tristate value. From the definition (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if NoMovement did not exist. If it is not guranteed, (dir1 != 0 && dir1 == -dir2) ? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2019-Sep-24, Kyotaro Horiguchi wrote: > Sorry, it's not a boolean. A tristate value. From the definition > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if > NoMovement did not exist. If it is not guranteed, > > (dir1 != 0 && dir1 == -dir2) ? Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that definition? :-) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190924120627.GA12454@alvherre.pgsql> > On 2019-Sep-24, Kyotaro Horiguchi wrote: > > > Sorry, it's not a boolean. A tristate value. From the definition > > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if > > NoMovement did not exist. If it is not guranteed, > > > > (dir1 != 0 && dir1 == -dir2) ? > > Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that > definition? :-) Yeah, sounds good to establish it as a part of ScanDirection's definition. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On Wed, Sep 25, 2019 at 3:03 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190924120627.GA12454@alvherre.pgsql> > > On 2019-Sep-24, Kyotaro Horiguchi wrote: > > > > > Sorry, it's not a boolean. A tristate value. From the definition > > > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if > > > NoMovement did not exist. If it is not guranteed, > > > > > > (dir1 != 0 && dir1 == -dir2) ? > > > > Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that > > definition? :-) > > Yeah, sounds good to establish it as a part of ScanDirection's > definition. Yep, this way looks better.
Attachment
On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > v27-0001-Index-skip-scan.patch Some random thoughts on this: * Is _bt_scankey_within_page() really doing the right thing within empty pages? It looks like you're accidentally using the high key when the leaf page is empty with forward scans (assuming that the leaf page isn't rightmost). You'll need to think about empty pages for both forward and backward direction scans there. Actually, using the high key in some cases may be desirable, once the details are worked out -- the high key is actually very helpful with low cardinality indexes. If you populate an index with retail insertions (i.e. don't just do a CREATE INDEX after the table is populated), and use low cardinality data in the indexed columns then you'll see this effect. You can have a few hundred index entries for each distinct value, and the page split logic added to Postgres 12 (by commit fab25024) will still manage to "trap" each set of duplicates on their own distinct leaf page. Leaf pages will have a high key that looks like the values that appear on the page to the right. The goal is for forward index scans to access the minimum number of leaf pages, especially with low cardinality data and with multi-column indexes. (See also: commit 29b64d1d) A good way to see this for yourself is to get the Wisconsin Benchmark tables (the tenk1 table and related tables from the regression tests) populated using retail insertions. "CREATE TABLE __tenk1(like tenk1 including indexes); INSERT INTO __tenk1 SELECT * FROM tenk1;" is how I like to set this up. Then you can see that we only access one leaf page easily by forcing bitmap scans (i.e. "set enable* ..."), and using "EXPLAIN (analyze, buffers) SELECT ... FROM __tenk1 WHERE ...", where the SELECT query is a simple point lookup query (bitmap scans happen to instrument the index buffer accesses in a way that makes it absolutely clear how many index page buffers were touched). IIRC the existing tenk1 indexes have no more than a few hundred duplicates for each distinct value in all cases, so only one leaf page needs to be accessed by simple "key = val" queries in all cases. (I imagine that the "four" index you add in the regression test would generally need to visit more than one leaf page for simple point lookup queries, but in any case the high key is a useful way of detecting a "break" in the values when indexing low cardinality data -- these breaks are generally "aligned" to leaf page boundaries.) I also like to visualize the keyspace of indexes when poking around at that stuff, generally by using some of the queries that you can find on the Wiki [1]. * The extra scankeys that you are using in most of the new nbtsearch.c code is an insertion scankey -- not a search style scankey. I think that you should try to be a bit clearer on that distinction in comments. It is already confusing now, but at least there is only zero or one insertion scankeys per scan (for the initial positioning). * There are two _bt_skip() prototypes in nbtree.h (actually, there is a btskip() and a _bt_skip()). I understand that the former is a public wrapper of the latter, but I find the naming a little bit confusing. Maybe rename _bt_skip() to something that is a little bit more suggestive of its purpose. * Suggest running pgindent on the patch. [1] https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index -- Peter Geoghegan
On Sat, Nov 2, 2019 at 11:56 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > v27-0001-Index-skip-scan.patch > > Some random thoughts on this: And now some more: * I'm confused about this code in _bt_skip(): > /* > * Andvance backward but read forward. At this moment we are at the next > * distinct key at the beginning of the series. In case if scan just > * started, we can read forward without doing anything else. Otherwise find > * previous distinct key and the beginning of it's series and read forward > * from there. To do so, go back one step, perform binary search to find > * the first item in the series and let _bt_readpage do everything else. > */ > else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir)) > { > if (!scanstart) > { > _bt_drop_lock_and_maybe_pin(scan, &so->currPos); > offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf); > > /* One step back to find a previous value */ > _bt_readpage(scan, dir, offnum); Why is it okay to call _bt_drop_lock_and_maybe_pin() like this? It looks like that will drop the lock (but not the pin) on the same buffer that you binary search with _bt_binsrch() (since the local variable "buf" is also the buf in "so->currPos"). * It also seems a bit odd that you assume that the scan is "scan->xs_want_itup", but then check that condition many times. Why bother? * Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all, rather than just unlocking the buffer directly? We'll only drop the pin for a scan that is "!scan->xs_want_itup", which is never the case within _bt_skip(). I think that the macros and stuff that manage pins and buffer locks in nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some value in trying to be consistent with existing nbtsearch.c code in ways that aren't strictly necessary. * Not sure why you need this code after throwing an error: > else > { > elog(ERROR, "Could not read closest index tuples: %d", offnum); > pfree(so->skipScanKey); > so->skipScanKey = NULL; > return false; > } [1] https://www.postgresql.org/message-id/flat/CAH2-Wz=m674-RKQdCG+jCD9QGzN1Kcg-FOdYw4-j+5_PfcHbpQ@mail.gmail.com -- Peter Geoghegan
> On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > v27-0001-Index-skip-scan.patch > > Some random thoughts on this: Thanks a lot for the commentaries! > * Is _bt_scankey_within_page() really doing the right thing within empty pages? > > It looks like you're accidentally using the high key when the leaf > page is empty with forward scans (assuming that the leaf page isn't > rightmost). You'll need to think about empty pages for both forward > and backward direction scans there. Yes, you're right, that's an issue I need to fix. > Actually, using the high key in some cases may be desirable, once the > details are worked out -- the high key is actually very helpful with > low cardinality indexes. If you populate an index with retail > insertions (i.e. don't just do a CREATE INDEX after the table is > populated), and use low cardinality data in the indexed columns then > you'll see this effect. Can you please elaborate a bit more? I see that using high key will help a forward index scans to access the minimum number of leaf pages, but I'm not following how is it connected to the _bt_scankey_within_page? Or is this commentary related in general to the whole implementation? > * The extra scankeys that you are using in most of the new nbtsearch.c > code is an insertion scankey -- not a search style scankey. I think > that you should try to be a bit clearer on that distinction in > comments. It is already confusing now, but at least there is only zero > or one insertion scankeys per scan (for the initial positioning). > > * There are two _bt_skip() prototypes in nbtree.h (actually, there is > a btskip() and a _bt_skip()). I understand that the former is a public > wrapper of the latter, but I find the naming a little bit confusing. > Maybe rename _bt_skip() to something that is a little bit more > suggestive of its purpose. > > * Suggest running pgindent on the patch. Sure, I'll incorporate mentioned improvements into the next patch version (hopefully soon). > And now some more: > > * I'm confused about this code in _bt_skip(): > Yeah, it shouldn't be there, but rather before _bt_next, that expects unlocked buffer. Will fix. > * It also seems a bit odd that you assume that the scan is > "scan->xs_want_itup", but then check that condition many times. Why > bother? > > * Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all, > rather than just unlocking the buffer directly? We'll only drop the > pin for a scan that is "!scan->xs_want_itup", which is never the case > within _bt_skip(). > > I think that the macros and stuff that manage pins and buffer locks in > nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some > value in trying to be consistent with existing nbtsearch.c code in > ways that aren't strictly necessary. Yep, I've seen this thread, but tried to be consistent with the surrounding core style. Probably it indeed doesn't make sense. > * Not sure why you need this code after throwing an error: > > > else > > { > > elog(ERROR, "Could not read closest index tuples: %d", offnum); > > pfree(so->skipScanKey); > > so->skipScanKey = NULL; > > return false; > > } Unfortunately this is just a leftover from a previous version. Sorry for that, will get rid of it.
> On Sun, Nov 03, 2019 at 05:45:47PM +0100, Dmitry Dolgov wrote: > > * The extra scankeys that you are using in most of the new nbtsearch.c > > code is an insertion scankey -- not a search style scankey. I think > > that you should try to be a bit clearer on that distinction in > > comments. It is already confusing now, but at least there is only zero > > or one insertion scankeys per scan (for the initial positioning). > > > > * There are two _bt_skip() prototypes in nbtree.h (actually, there is > > a btskip() and a _bt_skip()). I understand that the former is a public > > wrapper of the latter, but I find the naming a little bit confusing. > > Maybe rename _bt_skip() to something that is a little bit more > > suggestive of its purpose. > > > > * Suggest running pgindent on the patch. > > Sure, I'll incorporate mentioned improvements into the next patch > version (hopefully soon). Here is the new version, that addresses mentioned issues. > > * Is _bt_scankey_within_page() really doing the right thing within empty pages? > > > > It looks like you're accidentally using the high key when the leaf > > page is empty with forward scans (assuming that the leaf page isn't > > rightmost). You'll need to think about empty pages for both forward > > and backward direction scans there. > > Yes, you're right, that's an issue I need to fix. If I didn't misunderstood something, for the purpose of this function it makes sense to return false in the case of empty page. That's what I've added into the patch. > > Actually, using the high key in some cases may be desirable, once the > > details are worked out -- the high key is actually very helpful with > > low cardinality indexes. If you populate an index with retail > > insertions (i.e. don't just do a CREATE INDEX after the table is > > populated), and use low cardinality data in the indexed columns then > > you'll see this effect. > > Can you please elaborate a bit more? I see that using high key will help > a forward index scans to access the minimum number of leaf pages, but > I'm not following how is it connected to the _bt_scankey_within_page? Or > is this commentary related in general to the whole implementation? This question is still open.
Attachment
Hi, I've looked at the patch again - in general it seems in pretty good shape, all the issues I found are mostly minor. Firstly, I'd like to point out that not all of the things I complained about in my 2019/06/23 review got addressed. Those were mostly related to formatting and code style, and the attached patch fixes some (but maybe not all) of them. The patch also tweaks wording of some bits in the docs and comments that I found unclear. Would be nice if a native speaker could take a look. A couple more comments: 1) pathkey_is_unique The one additional issue I found is pathkey_is_unique - it's not really explained what "unique" means and hot it's different from "redundant" (which has quite a long explanation before pathkey_is_redundant). My understanding is that pathkey is "unique" when it's EC does not match an EC of another pathkey in the list. But if that's the case, then the function name is wrong - it does exactly the opposite (i.e. it returns 'true' when the pathkey is *not* unique). 2) explain I wonder if we should print the "Skip scan" info always, or similarly to "Inner Unique" which does this: /* try not to be too chatty about this in text mode */ if (es->format != EXPLAIN_FORMAT_TEXT || (es->verbose && ((Join *) plan)->inner_unique)) ExplainPropertyBool("Inner Unique", ((Join *) plan)->inner_unique, es); break; I'd do the same thing for skip scan - print it only in verbose mode, or when using non-text output format. 3) There's an annoying limitation that for this to kick in, the order of expressions in the DISTINCT clause has to match the index, i.e. with index on (a,b,c) the skip scan will only kick in for queries using DISTINCT a DISTINCT a,b DISTINCT a,b,c but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us to sort result of DISTINCT in any particular case, except that we don't consider the other orderings "interesting" so we don't really consider using the index (so no chance of using the skip scan). That leads to pretty annoying speedups/slowdowns due to seemingly irrelevant changes: -- everything great, a,b,c matches an index test=# explain (analyze, verbose) select distinct a,b,c from t; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 rows=1330 width=12) (actual time=0.016..10.387 rows=1331loops=1) Output: a, b, c Skip scan: true Heap Fetches: 1331 Planning Time: 0.106 ms Execution Time: 10.843 ms (6 rows) -- slow, mismatch with index test=# explain (analyze, verbose) select distinct a,c,b from t; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=22906.00..22919.30 rows=1330 width=12) (actual time=802.067..802.612 rows=1331 loops=1) Output: a, c, b Group Key: t.a, t.c, t.b -> Seq Scan on public.t (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.010..355.361 rows=1000000 loops=1) Output: a, b, c Planning Time: 0.076 ms Execution Time: 803.078 ms (7 rows) -- fast again, the extra ordering allows using the index again test=# explain (analyze, verbose) select distinct a,c,b from t order by a,b,c; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 rows=1330 width=12) (actual time=0.035..12.120 rows=1331loops=1) Output: a, c, b Skip scan: true Heap Fetches: 1331 Planning Time: 0.053 ms Execution Time: 12.632 ms (6 rows) This is a more generic issue, not specific to this patch, of course. I think we saw it with the incremental sort patch, IIRC. I wonder how difficult would it be to fix this here (not necessarily in v1). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi Tomas, On 11/10/19 4:18 PM, Tomas Vondra wrote: > I've looked at the patch again - in general it seems in pretty good > shape, all the issues I found are mostly minor. > > Firstly, I'd like to point out that not all of the things I complained > about in my 2019/06/23 review got addressed. Those were mostly related > to formatting and code style, and the attached patch fixes some (but > maybe not all) of them. > Sorry about that ! > The patch also tweaks wording of some bits in the docs and comments that > I found unclear. Would be nice if a native speaker could take a look. > > A couple more comments: > > > 1) pathkey_is_unique > > The one additional issue I found is pathkey_is_unique - it's not really > explained what "unique" means and hot it's different from "redundant" > (which has quite a long explanation before pathkey_is_redundant). > > My understanding is that pathkey is "unique" when it's EC does not match > an EC of another pathkey in the list. But if that's the case, then the > function name is wrong - it does exactly the opposite (i.e. it returns > 'true' when the pathkey is *not* unique). > Yeah, you are correct - forgot to move that part from the _uniquekey version of the patch. > > 2) explain > > I wonder if we should print the "Skip scan" info always, or similarly to > "Inner Unique" which does this: > > /* try not to be too chatty about this in text mode */ > if (es->format != EXPLAIN_FORMAT_TEXT || > (es->verbose && ((Join *) plan)->inner_unique)) > ExplainPropertyBool("Inner Unique", > ((Join *) plan)->inner_unique, > es); > break; > > I'd do the same thing for skip scan - print it only in verbose mode, or > when using non-text output format. > I think it is of benefit to see if skip scan kicks in, but used your "Skip scan" suggestion. > > 3) There's an annoying limitation that for this to kick in, the order of > expressions in the DISTINCT clause has to match the index, i.e. with > index on (a,b,c) the skip scan will only kick in for queries using > > DISTINCT a > DISTINCT a,b > DISTINCT a,b,c > > but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us > to sort result of DISTINCT in any particular case, except that we don't > consider the other orderings "interesting" so we don't really consider > using the index (so no chance of using the skip scan). > > That leads to pretty annoying speedups/slowdowns due to seemingly > irrelevant changes: > > -- everything great, a,b,c matches an index > test=# explain (analyze, verbose) select distinct a,b,c from t; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------- > > Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 > rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1) > Output: a, b, c > Skip scan: true > Heap Fetches: 1331 > Planning Time: 0.106 ms > Execution Time: 10.843 ms > (6 rows) > > -- slow, mismatch with index > test=# explain (analyze, verbose) select distinct a,c,b from t; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------- > > HashAggregate (cost=22906.00..22919.30 rows=1330 width=12) (actual > time=802.067..802.612 rows=1331 loops=1) > Output: a, c, b > Group Key: t.a, t.c, t.b > -> Seq Scan on public.t (cost=0.00..15406.00 rows=1000000 > width=12) (actual time=0.010..355.361 rows=1000000 loops=1) > Output: a, b, c > Planning Time: 0.076 ms > Execution Time: 803.078 ms > (7 rows) > > -- fast again, the extra ordering allows using the index again > test=# explain (analyze, verbose) select distinct a,c,b from t order by > a,b,c; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------- > > Index Only Scan using t_a_b_c_idx on public.t (cost=0.42..565.25 > rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1) > Output: a, c, b > Skip scan: true > Heap Fetches: 1331 > Planning Time: 0.053 ms > Execution Time: 12.632 ms > (6 rows) > > > This is a more generic issue, not specific to this patch, of course. I > think we saw it with the incremental sort patch, IIRC. I wonder how > difficult would it be to fix this here (not necessarily in v1). > Yeah, I see it as separate to this patch as well. But definitely something that should be revisited. Thanks for your patch ! v29 using UniqueKey attached. Best regards, Jesper
Attachment
Hi, On 11/11/19 1:24 PM, Jesper Pedersen wrote: > v29 using UniqueKey attached. > Just a small update to the UniqueKey patch to hopefully keep CFbot happy. Feedback, especially on the planner changes, would be greatly appreciated. Best regards, Jesper
Attachment
Hi all, I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below. - I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me: create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b, generate_series(1,10000)c); create index on t2 (a,b,c desc); postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc; QUERY PLAN --------------------------------------------------------------------------------- Index Only Scan using t2_a_b_c_idx on t2 (cost=0.43..216.25 rows=500 width=12) Skip scan: true Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5)) (3 rows) postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc; QUERY PLAN ----------------------------------------------------------------------------------------- Unique (cost=0.43..8361.56 rows=500 width=12) -> Index Only Scan using t2_a_b_c_idx on t2 (cost=0.43..8361.56 rows=9807 width=12) Index Cond: ((a = 2) AND (b = 5)) (3 rows) When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index scaneven though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather thanwriting b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in thefuture. I think for the sake of consistency it would make sense if this eventually gets addressed. - nodeIndexScan.c, line 126 This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the side-effectsthis has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin) - nbsearch.c, _bt_skip, line 1440 _bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately after thatoutside of the else. The second call can be removed I think. - nbtsearch.c _bt_skip line 1597 LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); scan->xs_itup = (IndexTuple) PageGetItem(page, itemid); This is an UNLOCK followed by a read of the unlocked page. That looks incorrect? - nbtsearch.c _bt_skip line 1440 if (BTScanPosIsValid(so->currPos) && _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) Is it allowed to look at the high key / low key of the page without have a read lock on it? - nbtsearch.c line 1634 if (_bt_readpage(scan, indexdir, offnum)) ... else error() Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index condition, notpart of the scan key, makes none of the keys on the page match? - nbtsearch.c in general Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never releasethe buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I'vebeen trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario -so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where page splitsand/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page check suchthat it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or cursor scans.Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear example readywhen it would go wrong. It may well be fine, but it was one of the things on my mind. [1] https://postgrespro.com/list/id/1566683972147.11682@Optiver.com -Floris
Hi Floris, On 1/15/20 8:33 AM, Floris Van Nee wrote: > I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below. > Thanks for your review ! > - I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me: > create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b, generate_series(1,10000)c); create index on t2 (a,b,c desc); > > postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc; > QUERY PLAN > --------------------------------------------------------------------------------- > Index Only Scan using t2_a_b_c_idx on t2 (cost=0.43..216.25 rows=500 width=12) > Skip scan: true > Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5)) > (3 rows) > > postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc; > QUERY PLAN > ----------------------------------------------------------------------------------------- > Unique (cost=0.43..8361.56 rows=500 width=12) > -> Index Only Scan using t2_a_b_c_idx on t2 (cost=0.43..8361.56 rows=9807 width=12) > Index Cond: ((a = 2) AND (b = 5)) > (3 rows) > > When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index scaneven though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather thanwriting b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in thefuture. I think for the sake of consistency it would make sense if this eventually gets addressed. > Agreed, that it is an improvement that should be made. I would like David's view on this since it relates to the UniqueKey patch. > - nodeIndexScan.c, line 126 > This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the side-effectsthis has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin) > Correct - fixed. > - nbsearch.c, _bt_skip, line 1440 > _bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately afterthat outside of the else. The second call can be removed I think. > Yes, removed the "else" call site. > - nbtsearch.c _bt_skip line 1597 > LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); > scan->xs_itup = (IndexTuple) PageGetItem(page, itemid); > > This is an UNLOCK followed by a read of the unlocked page. That looks incorrect? > Yes, that needed to be changed. > - nbtsearch.c _bt_skip line 1440 > if (BTScanPosIsValid(so->currPos) && > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) > > Is it allowed to look at the high key / low key of the page without have a read lock on it? > In case of a split the page will still contain a high key and a low key, so this should be ok. > - nbtsearch.c line 1634 > if (_bt_readpage(scan, indexdir, offnum)) ... > else > error() > > Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index condition,not part of the scan key, makes none of the keys on the page match? > The logic for this has been changed. > - nbtsearch.c in general > Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to neverrelease the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1]I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario- so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where pagesplits and/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page checksuch that it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or cursorscans. Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear exampleready when it would go wrong. It may well be fine, but it was one of the things on my mind. > There is a BT_READ lock in place when finding the correct leaf page, or searching within the leaf page itself. _bt_vacuum_one_page deletes only LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do you have some feedback for this ? Please, find the updated patches attached that Dmitry and I made. Thanks again ! Best regards, Jesper
Attachment
On Mon, Jan 20, 2020 at 11:01 AM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > > - nbtsearch.c _bt_skip line 1440 > > if (BTScanPosIsValid(so->currPos) && > > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) > > > > Is it allowed to look at the high key / low key of the page without have a read lock on it? > > > > In case of a split the page will still contain a high key and a low key, > so this should be ok. This is definitely not okay. > > - nbtsearch.c in general > > Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to neverrelease the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1]I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario- so it may be fine. Try making _bt_findinsertloc() call _bt_vacuum_one_page() whenever the page is P_HAS_GARBAGE(), regardless of whether or not the page is about to split. That will still be correct, while having a much better chance of breaking the patch during stress-testing. Relying on a buffer pin to prevent the B-Tree structure itself from changing in any important way seems likely to be broken already. Even if it isn't, it sounds fragile. A leaf page doesn't really have anything called a low key. It usually has a current first "data item"/non-pivot tuple, which is an inherently unstable thing. Also, it has a very loose relationship with the high key of the left sibling page, which the the closest thing to a low key that exists (often they'll have almost the same key values, but that is not guaranteed at all). While I haven't studied the patch, the logic within _bt_scankey_within_page() seems fishy to me for that reason. > There is a BT_READ lock in place when finding the correct leaf page, or > searching within the leaf page itself. _bt_vacuum_one_page deletes only > LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do > you have some feedback for this ? It sounds like the design of the patch relies on doing something other than stopping a scan "between" pages, in the sense that is outlined in the commit message of commit 09cb5c0e. If so, then that's a serious flaw in its design. -- Peter Geoghegan
On Mon, Jan 20, 2020 at 1:19 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jan 20, 2020 at 11:01 AM Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: > > > - nbtsearch.c _bt_skip line 1440 > > > if (BTScanPosIsValid(so->currPos) && > > > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) > > > > > > Is it allowed to look at the high key / low key of the page without have a read lock on it? > > > > > > > In case of a split the page will still contain a high key and a low key, > > so this should be ok. > > This is definitely not okay. I suggest that you find a way to add assertions to code like _bt_readpage() that verify that we do in fact have the buffer content lock. Actually, there is an existing assertion here that covers the pin, but not the buffer content lock: static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) { <declare variables> ... /* * We must have the buffer pinned and locked, but the usual macro can't be * used here; this function is what makes it good for currPos. */ Assert(BufferIsValid(so->currPos.buf)); You can add another assertion that calls a new utility function in bufmgr.c. That can use the same logic as this existing assertion in FlushOneBuffer(): Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); We haven't needed assertions like this so far because it's usually it is clear whether or not a buffer lock is held (plus the bufmgr.c assertions help on their own). The fact that it isn't clear whether or not a buffer lock will be held by caller here suggests a problem. Even still, having some guard rails in the form of these assertions could be helpful. Also, it seems like _bt_scankey_within_page() should have a similar set of assertions. BTW, there is a paper that describes optimizations like loose index scan and skip scan together, in fairly general terms: "Efficient Search of Multidimensional B-Trees". Loose index scans are given the name "MDAM duplicate elimination" in the paper. See: http://vldb.org/conf/1995/P710.PDF Goetz Graefe told me about the paper. It seems like the closest thing that exists to a taxonomy or conceptual framework for these techniques. -- Peter Geoghegan
> On Mon, Jan 20, 2020 at 05:05:33PM -0800, Peter Geoghegan wrote: > > I suggest that you find a way to add assertions to code like > _bt_readpage() that verify that we do in fact have the buffer content > lock. Actually, there is an existing assertion here that covers the > pin, but not the buffer content lock: > > static bool > _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > { > <declare variables> > ... > > /* > * We must have the buffer pinned and locked, but the usual macro can't be > * used here; this function is what makes it good for currPos. > */ > Assert(BufferIsValid(so->currPos.buf)); > > You can add another assertion that calls a new utility function in > bufmgr.c. That can use the same logic as this existing assertion in > FlushOneBuffer(): > > Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); > > We haven't needed assertions like this so far because it's usually it > is clear whether or not a buffer lock is held (plus the bufmgr.c > assertions help on their own). The fact that it isn't clear whether or > not a buffer lock will be held by caller here suggests a problem. Even > still, having some guard rails in the form of these assertions could > be helpful. Also, it seems like _bt_scankey_within_page() should have > a similar set of assertions. Thanks for suggestion. Agree, we will add such guards. It seems that in general I need to go through the locking in the patch one more time, since there are some gaps I din't notice/didn't know about before. > BTW, there is a paper that describes optimizations like loose index > scan and skip scan together, in fairly general terms: "Efficient > Search of Multidimensional B-Trees". Loose index scans are given the > name "MDAM duplicate elimination" in the paper. See: > > http://vldb.org/conf/1995/P710.PDF > > Goetz Graefe told me about the paper. It seems like the closest thing > that exists to a taxonomy or conceptual framework for these > techniques. Yes, I've read this paper, as it's indeed the only reference I found about this topic in literature. But unfortunately it's not much and (at least from the first read) gives only an overview of the idea.
> On Mon, Jan 20, 2020 at 01:19:30PM -0800, Peter Geoghegan wrote: Thanks for the commentaries. I'm trying to clarify your conclusions for myself, so couple of questions. > > > - nbtsearch.c in general > > > Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to neverrelease the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1]I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario- so it may be fine. > > Try making _bt_findinsertloc() call _bt_vacuum_one_page() whenever the > page is P_HAS_GARBAGE(), regardless of whether or not the page is > about to split. That will still be correct, while having a much better > chance of breaking the patch during stress-testing. > > Relying on a buffer pin to prevent the B-Tree structure itself from > changing in any important way seems likely to be broken already. Even > if it isn't, it sounds fragile. Except for checking low/high key (which should be done with a lock), I believe the current implementation follows the same pattern I see quite often, namely * get a lock on a page of interest and test it's values (if we can find next distinct value right on the next one without goind down the tree). * if not, unlock the current page, search within the tree with _bt_search (which locks a resuling new page) and examine values on a new page, when necessary do _bt_steppage Is there an obvious problem with this approach, when it comes to the page structure modification? > A leaf page doesn't really have anything called a low key. It usually > has a current first "data item"/non-pivot tuple, which is an > inherently unstable thing. Would this inherent instability be resolved for this particular case by having a lock on a page while checking a first data item, or there is something else I need to take into account? > > There is a BT_READ lock in place when finding the correct leaf page, or > > searching within the leaf page itself. _bt_vacuum_one_page deletes only > > LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do > > you have some feedback for this ? > > It sounds like the design of the patch relies on doing something other > than stopping a scan "between" pages, in the sense that is outlined in > the commit message of commit 09cb5c0e. If so, then that's a serious > flaw in its design. Could you please elaborate why does it sound like that? If I understand correctly, to stop a scan only "between" pages one need to use only _bt_readpage/_bt_steppage? Other than that there is no magic with scan position in the patch, so I'm not sure if I'm missing something here.
Hi Peter, Thanks for your feedback; Dmitry has followed-up with some additional questions. On 1/20/20 8:05 PM, Peter Geoghegan wrote: >> This is definitely not okay. > > I suggest that you find a way to add assertions to code like > _bt_readpage() that verify that we do in fact have the buffer content > lock. If you apply the attached patch on master it will fail the test suite; did you mean something else ? Best regards, Jesper
Attachment
> > Could you please elaborate why does it sound like that? If I understand > correctly, to stop a scan only "between" pages one need to use only > _bt_readpage/_bt_steppage? Other than that there is no magic with scan > position in the patch, so I'm not sure if I'm missing something here. Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page ithas locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose wehave a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked.Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full.So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situationswe could be left with is: Leaf page 1 = [2,4] Leaf page 2 = [5,6,8] However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matchingelements in our local buffer and we'll return those. But the skip scan currently: a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1 b) finds out that this is actually true c) does a search on the page and returns value=4 (while it should have returned value=6) Peter, is my understanding about the btree internals correct so far? Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but readbackward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pinon the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum suchthat the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore.Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even thoughthey relate to the same tuple. -Floris
> On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote: > > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page ithas locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose wehave a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked.Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full.So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situationswe could be left with is: > Leaf page 1 = [2,4] > Leaf page 2 = [5,6,8] > However, our scan is still pointing to leaf page 1. In case if we just returned a tuple, the next action would be either check the next page for another key or search down to the tree. Maybe I'm missing something in your scenario, but the latter will land us on a required page (we do not point to any leaf here), and before the former there is a check for high/low key. Is there anything else missing? > Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but readbackward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pinon the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum suchthat the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore.Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even thoughthey relate to the same tuple. Interesting point. The original idea here was to check that we're not returned to the same position after jumping, so maybe instead of offsets we can check a tuple we found.
> >
> > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1.
> In case if we just returned a tuple, the next action would be either
>> check the next page for another key or search down to the tree. Maybe
> required page (we do not point to any leaf here), and before the former
> there is a check for high/low key. Is there anything else missing?
On Tue, Jan 21, 2020 at 9:06 AM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > If you apply the attached patch on master it will fail the test suite; > did you mean something else ? Yeah, this is exactly what I had in mind for the _bt_readpage() assertion. As I said, it isn't a great sign that this kind of assertion is even necessary in index access method code (code like bufmgr.c is another matter). Usually it's just obvious that a buffer lock is held. I can't really blame this patch for that, though. You could say the same thing about the existing "buffer pin held" _bt_readpage() assertion. It's good that it verifies what is actually a fragile assumption, even though I'd prefer to not make a fragile assumption. -- Peter Geoghegan
On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee <florisvannee@optiver.com> wrote: > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page ithas locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose wehave a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked.Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full.So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situationswe could be left with is: > Leaf page 1 = [2,4] > Leaf page 2 = [5,6,8] > However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matchingelements in our local buffer and we'll return those. But the skip scan currently: > a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1 > b) finds out that this is actually true > c) does a search on the page and returns value=4 (while it should have returned value=6) > > Peter, is my understanding about the btree internals correct so far? This is a good summary. This is the kind of scenario I had in mind when I expressed a general concern about "stopping between pages". Processing a whole page at a time is a crucial part of how _bt_readpage() currently deals with concurrent page splits. Holding a buffer pin on a leaf page is only effective as an interlock against VACUUM completely removing a tuple, which could matter with non-MVCC scans. > Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but readbackward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pinon the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum suchthat the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore.Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even thoughthey relate to the same tuple. If skip scan is restricted to heapkeyspace indexes (i.e. those created on Postgres 12+), then it might be reasonable to save an index tuple, and relocate it within the same page using a fresh binary search that uses a scankey derived from the same index tuple -- without unsetting scantid/the heap TID scankey attribute. I suppose that you'll need to "find your place again" after releasing the buffer lock on a leaf page for a time. Also, I think that this will only be safe with MVCC scans, because otherwise the page could be concurrently deleted by VACUUM. -- Peter Geoghegan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <pg@bowt.ie> wrote: > This is a good summary. This is the kind of scenario I had in mind > when I expressed a general concern about "stopping between pages". > Processing a whole page at a time is a crucial part of how > _bt_readpage() currently deals with concurrent page splits. Note in particular that index scans cannot return the same index tuple twice -- processing a page at a time ensures that that cannot happen. Can a loose index scan return the same tuple (i.e. a tuple with the same heap TID) to the executor more than once? -- Peter Geoghegan
> Note in particular that index scans cannot return the same index tuple twice - > - processing a page at a time ensures that that cannot happen. > > Can a loose index scan return the same tuple (i.e. a tuple with the same heap > TID) to the executor more than once? > The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem.Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn'table to handle this? -Floris
> On Wed, Jan 22, 2020 at 05:24:43PM +0000, Floris Van Nee wrote: > > > > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the pageit has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Supposewe have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] > > > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned butunlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to befull. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of thesituations we could be left with is: > > > Leaf page 1 = [2,4] > > > Leaf page 2 = [5,6,8] > > > However, our scan is still pointing to leaf page 1. > > > In case if we just returned a tuple, the next action would be either > >> check the next page for another key or search down to the tree. Maybe > > But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_pageshortcut we're taking, there's no stepping to a next page involved. It just locks the page again thatit previously also locked. Yep, it would look only on the same page. Not sure what do you mean by "another key", if the current key is not found within the current page at the first stage, we restart from the root. > > I'm missing something in your scenario, but the latter will land us on a > > required page (we do not point to any leaf here), and before the former > > there is a check for high/low key. Is there anything else missing? > > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the onlypage after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're stillpointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there'sa page=2 with [5,6,8], however the ongoing index scan is unaware of that. > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey,so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key ofthe page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4to be returned. > > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the pagethat it's pointing to already. > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key inthis example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead.We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally inthe buffer already. That way we never need to lock the same page twice. Oh, that's what you mean. Yes, I was somehow tricked by the name of this function and didn't notice that it checks only one boundary, so in case of backward scan it returns wrong result. I think in the situation you've describe it would actually not find any item on the current page and restart from the root, but nevertheless we need to check for both keys in _bt_scankey_within_page.
> On Wed, Jan 22, 2020 at 09:08:59PM +0000, Floris Van Nee wrote: > > Note in particular that index scans cannot return the same index tuple twice - > > - processing a page at a time ensures that that cannot happen. > > > > Can a loose index scan return the same tuple (i.e. a tuple with the same heap > > TID) to the executor more than once? > > > > The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem. Yes, it shouldn't happen.
On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee <florisvannee@optiver.com> wrote: > The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a problem.Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that isn'table to handle this? I have no reason to believe that the executor has a problem with index scans that return a tuple more than once, aside from the very obvious: in general, that will often be wrong. It might not be wrong when the scan happens to be input to a unique node anyway, or something like that. I'm not particularly concerned about it. Just wanted to be clear on our assumptions for loose index scans -- if loose index scans were allowed to return a tuple more than once, then that would at least have to at least be considered in the wider context of the executor (but apparently they're not, so no need to worry about it). This may have been mentioned somewhere already. If it is then I must have missed it. -- Peter Geoghegan
On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Oh, that's what you mean. Yes, I was somehow tricked by the name of this > function and didn't notice that it checks only one boundary, so in case > of backward scan it returns wrong result. I think in the situation > you've describe it would actually not find any item on the current page > and restart from the root, but nevertheless we need to check for both > keys in _bt_scankey_within_page. I suggest reading the nbtree README file's description of backwards scans. Read the paragraph that begins with 'We support the notion of an ordered "scan" of an index...'. I also suggest that you read a bit of the stuff in the large section on page deletion. Certainly read the paragraph that begins with 'Moving left in a backward scan is complicated because...'. It's important to grok why it's okay that we don't "couple" or "crab" buffer locks as we descend the tree with Lehman & Yao's design -- we can get away with having *no* interlock against page splits (e.g., pin, buffer lock) when we are "between" levels of the tree. This is safe, since the page that we land on must still be "substantively the same page", no matter how much time passes. That is, it must at least cover the leftmost portion of the keyspace covered by the original version of the page that we saw that we needed to descend to within the parent page. The worst that can happen is that we have to recover from a concurrent page split by moving right one or more times. (Actually, page deletion can change the contents of a page entirely, but that's not really an exception to the general rule -- page deletion is careful about recycling pages that an in flight index scan might land on.) Lehman & Yao don't have backwards scans (or left links, or page deletion). Unlike nbtree. This is why the usual Lehman & Yao guarantees don't quite work with backward scans. We must therefore compensate as described by the README file (basically, we check and re-check for races, possibly returning to the original page when we think that we might have overlooked something and need to make sure). It's an exception to the general rule, you could say. -- Peter Geoghegan
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee > <florisvannee@optiver.com> wrote: > > As far as I know, a page split could happen at any random element in the page. One of the situations we could be leftwith is: > > Leaf page 1 = [2,4] > > Leaf page 2 = [5,6,8] > > However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read allmatching elements in our local buffer and we'll return those. But the skip scan currently: > > a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1 > > b) finds out that this is actually true > > c) does a search on the page and returns value=4 (while it should have returned value=6) > > > > Peter, is my understanding about the btree internals correct so far? > > This is a good summary. This is the kind of scenario I had in mind > when I expressed a general concern about "stopping between pages". > Processing a whole page at a time is a crucial part of how > _bt_readpage() currently deals with concurrent page splits. I want to be clear about what it means that the page doesn't have a "low key". Let us once again start with a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] -- just like in Floris' original page split scenario. Let us also say that page 1 has a left sibling page -- page 0. Page 0 happens to have a high key with the integer value 0. So you could *kind of* claim that the "low key" of page 1 is the integer value 0 (page 1 values must be > 0) -- *not* the integer value 2 (the so-called "low key" here is neither > 2, nor >= 2). More formally, an invariant exists that says that all values on page 1 must be *strictly* greater than the integer value 0. However, this formal invariant thing is hard or impossible to rely on when we actually reach page 1 and want to know about its lower bound -- since there is no "low key" pivot tuple on page 1 (we can only speak of a "low key" as an abstract concept, or something that works transitively from the parent -- there is only a physical high key pivot tuple on page 1 itself). Suppose further that Page 0 is now empty, apart from its "all values on page are <= 0" high key (page 0 must have had a few negative integer values in its tuples at some point, but not anymore). VACUUM will delete the page, *changing the effective low key* of Page 0 in the process. The lower bound from the shared parent page will move lower/left as a consequence of the deletion of page 0. nbtree page deletion makes the "keyspace move right, not left". So the "conceptual low key" of page 1 just went down from 0 to -5 (say), without there being any practical way of a skip scan reading page 1 noticing the change (the left sibling of page 0, page -1, has a high key of <= -5, say). Not only is it possible for somebody to insert the value 1 in page 1 -- now they can insert the value -3 or -4! More concretely, the pivot tuple in the parent that originally pointed to page 0 is still there -- all that page deletion changed about this tuple is its downlink, which now points to page 1 instead or page 0. Confusingly, page deletion removes the pivot tuple of the right sibling page from the parent -- *not* the pivot tuple of the empty page that gets deleted (in this case page 0) itself. Note: this example ignores things like negative infinity values in truncated pivot tuples, and the heap TID tiebreaker column -- in reality this would look a bit different because of those factors. See also: amcheck's bt_right_page_check_scankey() function, which has a huge comment that reasons about a race involving page deletion. In general, page deletion is by far the biggest source of complexity when reasoning about the key space. -- Peter Geoghegan
> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote: > > > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's theonly page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We'restill pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, sothere's a page=2 with [5,6,8], however the ongoing index scan is unaware of that. > > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey,so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key ofthe page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4to be returned. > > > > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the pagethat it's pointing to already. > > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key inthis example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead.We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally inthe buffer already. That way we never need to lock the same page twice. > > Oh, that's what you mean. Yes, I was somehow tricked by the name of this > function and didn't notice that it checks only one boundary, so in case > of backward scan it returns wrong result. I think in the situation > you've describe it would actually not find any item on the current page > and restart from the root, but nevertheless we need to check for both > keys in _bt_scankey_within_page. Thanks again everyone for commentaries and clarification. Here is the version, where hopefully I've addressed all the mentioned issues. As mentioned in the _bt_skip commentaries, before we were moving left to check the next page to avoid significant issues in case if ndistinct was underestimated and we need to skip too often. To make it work safe in presense of splits we need to remember an original page and move right again until we find a page with the right link pointing to it. It's not clear whether it's worth to increase complexity for such sort of "edge case" with ndistinct estimation while moving left, so at least for now we ignore this in the implementation and just start from the root immediately. Offset based code in moving forward/reading backward was replaced with remembering a start index tuple and an attempt to find it on the new page. Also a missing page lock before _bt_scankey_within_page was added and _bt_scankey_within_page checks for both page boundaries.
Attachment
Hi Dmitry, Thanks for the new patch! I tested it and managed to find a case that causes some issues. Here's how to reproduce: drop table if exists t; create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, generate_series(1,1000) b; create index on t (a,b,c,d); -- correct postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forwardall from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+------+---+---- 5 | 1000 | 0 | 10 4 | 1000 | 0 | 10 3 | 1000 | 0 | 10 2 | 1000 | 0 | 10 1 | 1000 | 0 | 10 (5 rows) a | b | c | d ---+------+---+---- 1 | 1000 | 0 | 10 2 | 1000 | 0 | 10 3 | 1000 | 0 | 10 4 | 1000 | 0 | 10 5 | 1000 | 0 | 10 (5 rows) -- now delete some rows postgres=# delete from t where a=3; DELETE 1000 -- and rerun: error is thrown postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forwardall from c; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+------+---+---- 5 | 1000 | 0 | 10 4 | 1000 | 0 | 10 2 | 1000 | 0 | 10 1 | 1000 | 0 | 10 (4 rows) ERROR: lock buffer_content is not held ROLLBACK A slightly different situation arises when executing the cursor with an ORDER BY a, b instead of the ORDER BY a DESC, b DESC: -- recreate table again and execute the delete as above postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a, b; fetch forward all fromc; fetch backward all from c; commit; BEGIN DECLARE CURSOR a | b | c | d ---+---+---+---- 1 | 1 | 1 | 10 2 | 1 | 1 | 10 4 | 1 | 1 | 10 5 | 1 | 1 | 10 (4 rows) a | b | c | d ---+-----+---+---- 5 | 1 | 1 | 10 4 | 1 | 1 | 10 2 | 827 | 1 | 10 1 | 1 | 1 | 10 (4 rows) COMMIT And lastly, you'll also get incorrect results if you do the delete slightly differently: -- leave one row where a=3 and b=1000 postgres=# delete from t where a=3 and b<=999; -- the cursor query above won't show any of the a=3 rows even though they should -Floris
Oh, interesting, thank you. I believe I know what happened, there is one unnecessary locking part that eventually gives only problems, plus one direct access to a page items without _bt_readpage. Will post a new version soon. On Mon, Jan 27, 2020 at 3:00 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > Hi Dmitry, > > Thanks for the new patch! I tested it and managed to find a case that causes some issues. Here's how to reproduce: > > drop table if exists t; > create table t as select a,b,b%2 as c,10 as d from generate_series(1,5) a, generate_series(1,1000) b; > create index on t (a,b,c,d); > > -- correct > postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forwardall from c; fetch backward all from c; commit; > BEGIN > DECLARE CURSOR > a | b | c | d > ---+------+---+---- > 5 | 1000 | 0 | 10 > 4 | 1000 | 0 | 10 > 3 | 1000 | 0 | 10 > 2 | 1000 | 0 | 10 > 1 | 1000 | 0 | 10 > (5 rows) > > a | b | c | d > ---+------+---+---- > 1 | 1000 | 0 | 10 > 2 | 1000 | 0 | 10 > 3 | 1000 | 0 | 10 > 4 | 1000 | 0 | 10 > 5 | 1000 | 0 | 10 > (5 rows) > > -- now delete some rows > postgres=# delete from t where a=3; > DELETE 1000 > > -- and rerun: error is thrown > postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a desc, b desc; fetch forwardall from c; fetch backward all from c; commit; > BEGIN > DECLARE CURSOR > a | b | c | d > ---+------+---+---- > 5 | 1000 | 0 | 10 > 4 | 1000 | 0 | 10 > 2 | 1000 | 0 | 10 > 1 | 1000 | 0 | 10 > (4 rows) > > ERROR: lock buffer_content is not held > ROLLBACK > > > A slightly different situation arises when executing the cursor with an ORDER BY a, b instead of the ORDER BY a DESC, bDESC: > -- recreate table again and execute the delete as above > > postgres=# begin; declare c scroll cursor for select distinct on (a) a,b,c,d from t order by a, b; fetch forward all fromc; fetch backward all from c; commit; > BEGIN > DECLARE CURSOR > a | b | c | d > ---+---+---+---- > 1 | 1 | 1 | 10 > 2 | 1 | 1 | 10 > 4 | 1 | 1 | 10 > 5 | 1 | 1 | 10 > (4 rows) > > a | b | c | d > ---+-----+---+---- > 5 | 1 | 1 | 10 > 4 | 1 | 1 | 10 > 2 | 827 | 1 | 10 > 1 | 1 | 1 | 10 > (4 rows) > > COMMIT > > And lastly, you'll also get incorrect results if you do the delete slightly differently: > -- leave one row where a=3 and b=1000 > postgres=# delete from t where a=3 and b<=999; > -- the cursor query above won't show any of the a=3 rows even though they should > > > -Floris >
> On Mon, Jan 27, 2020 at 02:00:39PM +0000, Floris Van Nee wrote: > > Thanks for the new patch! I tested it and managed to find a case that causes > some issues. Here's how to reproduce: So, after a bit of investigation I found out the issue (it was actually there even in the previous version). In this only case of moving forward and reading backward, exactly scenarious you've described above, current implementation was not ignoring deleted tuples. My first idea to fix this was to use _bt_readpage when necessary and put couple of _bt_killitems when we leave a page while jumping before, so that deleted tuples will be ignored. To demonstrate it visually, let's say we want to go backward on a cursor over an ORDER BY a DESC, b DESC query, i.e. return: (1,100), (2, 100), (3, 100) etc. To achieve that we jump from (1,1) to (1,100), from (2,1) to (2,100) and so on. If some values are deleted, we need to read backward. E.g. if (3,100) is deleted, we need to return (3,99). +---------------+---------------+---------------+---------------+ | | | | | | 1,1 ... 1,100 | 2,1 ... 2,100 | 3,1 ... 3,100 | 4,1 ... 4,100 | | | | | | +---------------+---------------+---------------+---------------+ | ^ | ^ | ^ | ^ | | | | | | | | +-------------+ +-------------+ +-------------+ +-------------+ If it happened that a whole value series is deleted, we return to the previous value and need to detect such situation. E.g. if all the values from (3,1) to (3,100) were deleted, we will return to (2,100). +---------------+---------------+ +---------------+ | | | | | | 1,1 ... 1,100 | 2,1 ... 2,100 |<--------------+ 4,1 ... 4,100 | | | | | | +---------------+---------------+ +---------------+ ^ | ^ | ^ | ^ | | | | | | | | +-------------+ +-------------+ +-------------+ | +-----------------------------+ This all is implemented inside _bt_skip. Unfortunately as I see it now the idea of relying on ability to skip dead index tuples without checking a heap tuple is not reliable, since an index tuple will be added into killedItems and can be marked as dead only when not a single transaction can see it anymore. Potentially there are two possible solutions: * Adjust the code in nodeIndexOnlyscan to perform a proper visibility check and understand if we returned back. Obviously it will make the patch more invasive. * Reduce scope of the patch, and simply do not apply jumping in this case. This means less functionality but hopefully still brings some value. At this point me and Jesper inclined to go with the second option. But maybe I'm missing something, are there any other suggestions?
> this point me and Jesper inclined to go with the second option. But maybe > I'm missing something, are there any other suggestions? Unfortunately I figured this would need a more invasive fix. I tend to agree that it'd be better to not skip in situationslike this. I think it'd make most sense to make any plan for these 'prepare/fetch' queries would not use skip,but rather a materialize node, right? -Floris
> On Tue, Feb 04, 2020 at 08:34:09PM +0000, Floris Van Nee wrote: > > > this point me and Jesper inclined to go with the second option. But maybe > > I'm missing something, are there any other suggestions? > > Unfortunately I figured this would need a more invasive fix. I tend to agree that it'd be better to not skip in situationslike this. I think it'd make most sense to make any plan for these 'prepare/fetch' queries would not use skip,but rather a materialize node, right? Yes, sort of, without a skip scan it would be just an index only scan with unique on top. Actually it's not immediately clean how to achieve this, since at the moment, when planner is deciding to consider index skip scan, there is no information about neither direction nor whether we're dealing with a cursor. Maybe we can somehow signal to the decision logic that the root was a DeclareCursorStmt by e.g. introducing a new field to the query structure (or abusing an existing one, since DeclareCursorStmt is being processed by standard_ProcessUtility, just for a test I've tried to use utilityStmt of a nested statement hoping that it's unused and it didn't break tests yet).
At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in > > On Tue, Feb 04, 2020 at 08:34:09PM +0000, Floris Van Nee wrote: > > > > > this point me and Jesper inclined to go with the second option. But maybe > > > I'm missing something, are there any other suggestions? > > > > Unfortunately I figured this would need a more invasive fix. I tend to agree that it'd be better to not skip in situationslike this. I think it'd make most sense to make any plan for these 'prepare/fetch' queries would not use skip,but rather a materialize node, right? > > Yes, sort of, without a skip scan it would be just an index only scan > with unique on top. Actually it's not immediately clean how to achieve > this, since at the moment, when planner is deciding to consider index > skip scan, there is no information about neither direction nor whether > we're dealing with a cursor. Maybe we can somehow signal to the decision > logic that the root was a DeclareCursorStmt by e.g. introducing a new > field to the query structure (or abusing an existing one, since > DeclareCursorStmt is being processed by standard_ProcessUtility, just > for a test I've tried to use utilityStmt of a nested statement hoping > that it's unused and it didn't break tests yet). Umm. I think it's a wrong direction. While defining a cursor, default scrollability is decided based on the query allows backward scan or not. That is, the definition of backward-scan'ability is not just whether it can scan from the end toward the beginning, but whether it can go back and forth freely or not. In that definition, the *current* skip scan does not supporting backward scan. If we want to allow descending order-by in a query, we should support scrollable cursor, too. We could add an additional parameter "in_cursor" to ExecSupportBackwardScan and let skip scan return false if in_cursor is true, but I'm not sure it's acceptable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in > > > On Tue, Feb 04, 2020 at 08:34:09PM +0000, Floris Van Nee wrote: > > > > > > > this point me and Jesper inclined to go with the second option. But maybe > > > > I'm missing something, are there any other suggestions? > > > > > > Unfortunately I figured this would need a more invasive fix. I tend to agree that it'd be better to not skip in situationslike this. I think it'd make most sense to make any plan for these 'prepare/fetch' queries would not use skip,but rather a materialize node, right? > > > > Yes, sort of, without a skip scan it would be just an index only scan > > with unique on top. Actually it's not immediately clean how to achieve > > this, since at the moment, when planner is deciding to consider index > > skip scan, there is no information about neither direction nor whether > > we're dealing with a cursor. Maybe we can somehow signal to the decision > > logic that the root was a DeclareCursorStmt by e.g. introducing a new > > field to the query structure (or abusing an existing one, since > > DeclareCursorStmt is being processed by standard_ProcessUtility, just > > for a test I've tried to use utilityStmt of a nested statement hoping > > that it's unused and it didn't break tests yet). > > Umm. I think it's a wrong direction. While defining a cursor, > default scrollability is decided based on the query allows backward > scan or not. That is, the definition of backward-scan'ability is not > just whether it can scan from the end toward the beginning, but > whether it can go back and forth freely or not. In that definition, > the *current* skip scan does not supporting backward scan. If we want > to allow descending order-by in a query, we should support scrollable > cursor, too. > > We could add an additional parameter "in_cursor" to > ExecSupportBackwardScan and let skip scan return false if in_cursor is > true, but I'm not sure it's acceptable. I also was thinking about whether it's possible to use ExecSupportBackwardScan here, but skip scan is just a mode of an index/indexonly scan. Which means that ExecSupportBackwardScan also need to know somehow if this mode is being used, and then, since this function is called after it's already decided to use skip scan in the resulting plan, somehow correct the plan (exclude skipping and try to find next best path?) - do I understand your suggestion correct?
At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in > > We could add an additional parameter "in_cursor" to > > ExecSupportBackwardScan and let skip scan return false if in_cursor is > > true, but I'm not sure it's acceptable. > > I also was thinking about whether it's possible to use > ExecSupportBackwardScan here, but skip scan is just a mode of an > index/indexonly scan. Which means that ExecSupportBackwardScan also need > to know somehow if this mode is being used, and then, since this > function is called after it's already decided to use skip scan in the > resulting plan, somehow correct the plan (exclude skipping and try to > find next best path?) - do I understand your suggestion correct? I didn't thought so hardly, but a bit of confirmation told me that IndexSupportsBackwardScan returns fixed flag for AM. It seems that things are not that simple. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry, I forgot to write more significant thing. On 2020/02/06 21:22, Kyotaro Horiguchi wrote: > At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in >>> On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: >>> At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in >>> We could add an additional parameter "in_cursor" to >>> ExecSupportBackwardScan and let skip scan return false if in_cursor is >>> true, but I'm not sure it's acceptable. >> I also was thinking about whether it's possible to use >> ExecSupportBackwardScan here, but skip scan is just a mode of an >> index/indexonly scan. Which means that ExecSupportBackwardScan also need >> to know somehow if this mode is being used, and then, since this >> function is called after it's already decided to use skip scan in the >> resulting plan, somehow correct the plan (exclude skipping and try to >> find next best path?) - do I understand your suggestion correct? No. I thought of the opposite thing. I meant that IndexSupportsBackwardScan returns false if Index(Only)Scan is going to do skip scan. But I found that the function doesn't have access to plan node nor executor node. So I wrote as the follows. >> I didn't thought so hardly, but a bit of confirmation told me that >> IndexSupportsBackwardScan returns fixed flag for AM. It seems that >> things are not that simple. regards. -- Kyotaro Horiguchi
> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote: > At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in > > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: > > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in > > > We could add an additional parameter "in_cursor" to > > > ExecSupportBackwardScan and let skip scan return false if in_cursor is > > > true, but I'm not sure it's acceptable. > > > > I also was thinking about whether it's possible to use > > ExecSupportBackwardScan here, but skip scan is just a mode of an > > index/indexonly scan. Which means that ExecSupportBackwardScan also need > > to know somehow if this mode is being used, and then, since this > > function is called after it's already decided to use skip scan in the > > resulting plan, somehow correct the plan (exclude skipping and try to > > find next best path?) - do I understand your suggestion correct? > > I didn't thought so hardly, but a bit of confirmation told me that > IndexSupportsBackwardScan returns fixed flag for AM. It seems that > things are not that simple. Yes, I've mentioned that already in one of the previous emails :) The simplest way I see to achieve what we want is to do something like in attached modified version with a new hasDeclaredCursor field. It's not a final version though, but posted just for discussion, so feel free to suggest any improvements or alternatives.
Attachment
On Fri, Feb 07, 2020 at 05:25:43PM +0100, Dmitry Dolgov wrote: >> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote: >> At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in >> > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote: >> > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in >> > > We could add an additional parameter "in_cursor" to >> > > ExecSupportBackwardScan and let skip scan return false if in_cursor is >> > > true, but I'm not sure it's acceptable. >> > >> > I also was thinking about whether it's possible to use >> > ExecSupportBackwardScan here, but skip scan is just a mode of an >> > index/indexonly scan. Which means that ExecSupportBackwardScan also need >> > to know somehow if this mode is being used, and then, since this >> > function is called after it's already decided to use skip scan in the >> > resulting plan, somehow correct the plan (exclude skipping and try to >> > find next best path?) - do I understand your suggestion correct? >> >> I didn't thought so hardly, but a bit of confirmation told me that >> IndexSupportsBackwardScan returns fixed flag for AM. It seems that >> things are not that simple. > >Yes, I've mentioned that already in one of the previous emails :) The >simplest way I see to achieve what we want is to do something like in >attached modified version with a new hasDeclaredCursor field. It's not a >final version though, but posted just for discussion, so feel free to >suggest any improvements or alternatives. IMO the proper fix for this case (moving forward, reading backwards) is simply making it work by properly checking deleted tuples etc. Not sure why that would be so much complex (haven't tried implementing it)? I think making this depend on things like declared cursor etc. is going to be tricky, may easily be more complex than checking deleted tuples, and the behavior may be quite surprising. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I've done some testing and benchmarking of the v31 patch, looking for regressions, costing issues etc. Essentially, I've ran a bunch of SELECT DISTINCT queries on data sets of various size, number of distinct values etc. The results are fairly large, so I've uploaded them to github https://github.com/tvondra/skip-scan-test There are four benchmark groups, depending on how the data are generated and availability of extended stats and if the columns are independent: 1) skipscan - just indexes, columns are independent 2) skipscan-with-stats - indexes and extended stats, independent columns 3) skipscan-correlated - just indexes, correlated columns 4) skipscan-correlated-with-stats - indexes and extended stats, correlated columns The github repository contains *.ods spreadsheet comparing duration with the regular query plan (no skip scan) and skip scan. In general, there are pretty massive speedups, often by about two orders of magnitude. There are a couple of regressions, where the plan with skipscan enables is ~10x slower. But this seems to happen only in high-cardinality cases where we misestimate the number of groups. Consider a table with two independent columns CREATE TABLE t (a text, b text); INSERT INTO t SELECT md5((10000*random())::int::text), md5((10000*random())::int::text) FROM generate_series(1,1000000) s(i); CREATE INDEX ON t(a,b); ANALYZE; which then behaves like this: test=# select * from (select distinct a,b from t) foo offset 10000000; Time: 3138.222 ms (00:03.138) test=# set enable_indexskipscan = off; Time: 0.312 ms test=# select * from (select distinct a,b from t) foo offset 10000000; Time: 199.749 ms So in this case the skip scan is ~15x slower than the usual plan (index only scan + unique). The reason why this happens is pretty simple - to estimate the number of groups we multiply the ndistinct estimates for the two columns (which both have n_distinct = 10000), but then we cap the estimate to 10% of the table. But when the columns are independent with high cardinalities that under-estimates the actual value, making the cost for skip scan much lower than it should be. I don't think this is an issue the skipscan patch needs to fix, though. Firstly, the regressed cases are a tiny minority. Secondly, we already have a way to improve the root cause - creating extended stats with ndistinct coefficients generally makes the problem go away. One interesting observation however is that this regression only happened with text columns but not with int or bigint. My assumption is that this is due to text comparisons being much more expensive. Not sure if there is something we could do to deal with this - reduce the number of comparisons or something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
OK, A couple more comments based on quick review of the patch, particularly the part related to planning: 1) create_skipscan_unique_path has one assert commented out. Either it's something we want to enforce, or we should remove it. /*Assert(distinctPrefixKeys <= list_length(pathnode->path.pathkeys));*/ 2) I wonder if the current costing model is overly optimistic. We simply copy the startup cost from the IndexPath, which seems fine. But for total cost we do this: pathnode->path.total_cost = basepath->startup_cost * numGroups; which seems a bit too simplistic. The startup cost is pretty much just the cost to find the first item in the index, but surely we need to do more to find the next group - we need to do comparisons to skip some of the items, etc. If we think that's unnecessary, we need to explain it in a comment or somthing. 3) I don't think we should make planning dependent on hasDeclareCursor. 4) I'm not quite sure how sensible it's to create a new IndexPath in create_skipscan_unique_path. On the one hand it works, but I don't think any other path is constructed like this so I wonder if we're missing something. Perhaps it'd be better to just add a new path node on top of the IndexPath, and then handle this in create_plan. We already do something similar for Bitmap Index Scans, where we create a different executor node from IndexPath depending on the parent node. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: > > I've done some testing and benchmarking of the v31 patch, looking for > regressions, costing issues etc. Essentially, I've ran a bunch of SELECT > DISTINCT queries on data sets of various size, number of distinct values > etc. The results are fairly large, so I've uploaded them to github > > https://github.com/tvondra/skip-scan-test Thanks a lot for testing! > There are a couple of regressions, where the plan with skipscan enables > is ~10x slower. But this seems to happen only in high-cardinality cases > where we misestimate the number of groups. Consider a table with two > independent columns > > CREATE TABLE t (a text, b text); > INSERT INTO t SELECT > md5((10000*random())::int::text), > md5((10000*random())::int::text) > FROM generate_series(1,1000000) s(i); > > CREATE INDEX ON t(a,b); > > ANALYZE; > > which then behaves like this: > > test=# select * from (select distinct a,b from t) foo offset 10000000; > Time: 3138.222 ms (00:03.138) > test=# set enable_indexskipscan = off; > Time: 0.312 ms > test=# select * from (select distinct a,b from t) foo offset 10000000; > Time: 199.749 ms > > So in this case the skip scan is ~15x slower than the usual plan (index > only scan + unique). The reason why this happens is pretty simple - to > estimate the number of groups we multiply the ndistinct estimates for > the two columns (which both have n_distinct = 10000), but then we cap > the estimate to 10% of the table. But when the columns are independent > with high cardinalities that under-estimates the actual value, making > the cost for skip scan much lower than it should be. The current implementation checks if we can find the next value on the same page to do a shortcut instead of tree traversal and improve such kind of situations, but I can easily imagine that it's still not enough in some extreme situations. > I don't think this is an issue the skipscan patch needs to fix, though. > Firstly, the regressed cases are a tiny minority. Secondly, we already > have a way to improve the root cause - creating extended stats with > ndistinct coefficients generally makes the problem go away. Yes, I agree. > One interesting observation however is that this regression only > happened with text columns but not with int or bigint. My assumption is > that this is due to text comparisons being much more expensive. Not sure > if there is something we could do to deal with this - reduce the number > of comparisons or something? Hm, interesting. I need to check that we do not do any unnecessary comparisons. > On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote: > > Yes, I've mentioned that already in one of the previous emails :) The > > simplest way I see to achieve what we want is to do something like in > > attached modified version with a new hasDeclaredCursor field. It's not a > > final version though, but posted just for discussion, so feel free to > > suggest any improvements or alternatives. > > IMO the proper fix for this case (moving forward, reading backwards) is > simply making it work by properly checking deleted tuples etc. Not sure > why that would be so much complex (haven't tried implementing it)? It's probably not that complex by itself, but requires changing responsibilities isolation. At the moment current implementation leaves jumping over a tree fully to _bt_skip, and heap visibility checks only to IndexOnlyNext. To check deleted tuples properly we need to either verify a corresponding heap tuple visibility inside _bt_skip (as I've mentioned in one of the previous emails, checking if an index tuple is dead is not enough), or teach the code in IndexOnlyNext to understand that _bt_skip can lead to returning the same tuple while moving forward & reading backward. Do you think it's still makes sense to go this way?
On Sat, Feb 08, 2020 at 04:24:40PM +0100, Dmitry Dolgov wrote: >> On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: >> >> I've done some testing and benchmarking of the v31 patch, looking for >> regressions, costing issues etc. Essentially, I've ran a bunch of SELECT >> DISTINCT queries on data sets of various size, number of distinct values >> etc. The results are fairly large, so I've uploaded them to github >> >> https://github.com/tvondra/skip-scan-test > >Thanks a lot for testing! > >> There are a couple of regressions, where the plan with skipscan enables >> is ~10x slower. But this seems to happen only in high-cardinality cases >> where we misestimate the number of groups. Consider a table with two >> independent columns >> >> CREATE TABLE t (a text, b text); >> INSERT INTO t SELECT >> md5((10000*random())::int::text), >> md5((10000*random())::int::text) >> FROM generate_series(1,1000000) s(i); >> >> CREATE INDEX ON t(a,b); >> >> ANALYZE; >> >> which then behaves like this: >> >> test=# select * from (select distinct a,b from t) foo offset 10000000; >> Time: 3138.222 ms (00:03.138) >> test=# set enable_indexskipscan = off; >> Time: 0.312 ms >> test=# select * from (select distinct a,b from t) foo offset 10000000; >> Time: 199.749 ms >> >> So in this case the skip scan is ~15x slower than the usual plan (index >> only scan + unique). The reason why this happens is pretty simple - to >> estimate the number of groups we multiply the ndistinct estimates for >> the two columns (which both have n_distinct = 10000), but then we cap >> the estimate to 10% of the table. But when the columns are independent >> with high cardinalities that under-estimates the actual value, making >> the cost for skip scan much lower than it should be. > >The current implementation checks if we can find the next value on the >same page to do a shortcut instead of tree traversal and improve such >kind of situations, but I can easily imagine that it's still not enough >in some extreme situations. > Yeah. I'm not sure there's room for further improvements. The regressed cases were subject to the 10% cap, and with ndistinct being more than 10% of the table, we probably can find many distinct keys on each index page - we know that every ~10 rows the values change. >> I don't think this is an issue the skipscan patch needs to fix, though. >> Firstly, the regressed cases are a tiny minority. Secondly, we already >> have a way to improve the root cause - creating extended stats with >> ndistinct coefficients generally makes the problem go away. > >Yes, I agree. > >> One interesting observation however is that this regression only >> happened with text columns but not with int or bigint. My assumption is >> that this is due to text comparisons being much more expensive. Not sure >> if there is something we could do to deal with this - reduce the number >> of comparisons or something? > >Hm, interesting. I need to check that we do not do any unnecessary >comparisons. > >> On Sat, Feb 08, 2020 at 02:11:59PM +0100, Tomas Vondra wrote: >> > Yes, I've mentioned that already in one of the previous emails :) The >> > simplest way I see to achieve what we want is to do something like in >> > attached modified version with a new hasDeclaredCursor field. It's not a >> > final version though, but posted just for discussion, so feel free to >> > suggest any improvements or alternatives. >> >> IMO the proper fix for this case (moving forward, reading backwards) is >> simply making it work by properly checking deleted tuples etc. Not sure >> why that would be so much complex (haven't tried implementing it)? > >It's probably not that complex by itself, but requires changing >responsibilities isolation. At the moment current implementation leaves >jumping over a tree fully to _bt_skip, and heap visibility checks only >to IndexOnlyNext. To check deleted tuples properly we need to either >verify a corresponding heap tuple visibility inside _bt_skip (as I've >mentioned in one of the previous emails, checking if an index tuple is >dead is not enough), or teach the code in IndexOnlyNext to understand >that _bt_skip can lead to returning the same tuple while moving forward >& reading backward. Do you think it's still makes sense to go this way? Not sure. I have to think about this first. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Feb 8, 2020 at 10:24 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: > > So in this case the skip scan is ~15x slower than the usual plan (index > > only scan + unique). The reason why this happens is pretty simple - to > > estimate the number of groups we multiply the ndistinct estimates for > > the two columns (which both have n_distinct = 10000), but then we cap > > the estimate to 10% of the table. But when the columns are independent > > with high cardinalities that under-estimates the actual value, making > > the cost for skip scan much lower than it should be. > > The current implementation checks if we can find the next value on the > same page to do a shortcut instead of tree traversal and improve such > kind of situations, but I can easily imagine that it's still not enough > in some extreme situations. This is almost certainly rehashing already covered ground, but since I doubt it's been discussed recently, would you be able to summarize that choice (not to always get the next tuple by scanning from the top of the tree again) and the performance/complexity tradeoffs? Thanks, James
> On Sat, Feb 08, 2020 at 01:31:02PM -0500, James Coleman wrote: > On Sat, Feb 8, 2020 at 10:24 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > On Sat, Feb 08, 2020 at 03:22:17PM +0100, Tomas Vondra wrote: > > > So in this case the skip scan is ~15x slower than the usual plan (index > > > only scan + unique). The reason why this happens is pretty simple - to > > > estimate the number of groups we multiply the ndistinct estimates for > > > the two columns (which both have n_distinct = 10000), but then we cap > > > the estimate to 10% of the table. But when the columns are independent > > > with high cardinalities that under-estimates the actual value, making > > > the cost for skip scan much lower than it should be. > > > > The current implementation checks if we can find the next value on the > > same page to do a shortcut instead of tree traversal and improve such > > kind of situations, but I can easily imagine that it's still not enough > > in some extreme situations. > > This is almost certainly rehashing already covered ground, but since I > doubt it's been discussed recently, would you be able to summarize > that choice (not to always get the next tuple by scanning from the top > of the tree again) and the performance/complexity tradeoffs? Yeah, this part of discussion happened already some time ago. The idea [1] is to protect ourselves at least partially from incorrect ndistinct estimations. Simply doing jumping over an index means that even if the next key we're searching for is on the same page as previous, we still end up doing a search from the root of the tree, which is of course less efficient than just check right on the page before jumping further. Performance tradeoff in this case is simple, we make regular use case slightly slower, but can perform better in the worst case scenarios. Complexity tradeoff was never discussed, but I guess everyone assumed it's relatively straightforward to check the current page and return if something was found before jumping. [1]: https://www.postgresql.org/message-id/CA%2BTgmoY7QTHhzLWZupNSyyqFRBfMgYocg3R-6g%3DDRgT4-KBGqg%40mail.gmail.com
Thank you very much for the benchmarking! A bit different topic from the latest branch.. At Sat, 8 Feb 2020 14:11:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in > >Yes, I've mentioned that already in one of the previous emails :) The > >simplest way I see to achieve what we want is to do something like in > >attached modified version with a new hasDeclaredCursor field. It's not > >a > >final version though, but posted just for discussion, so feel free to > >suggest any improvements or alternatives. > > IMO the proper fix for this case (moving forward, reading backwards) > is > simply making it work by properly checking deleted tuples etc. Not > sure > why that would be so much complex (haven't tried implementing it)? I don't think it's not so complex. But I suspect that it might be a bit harder starting from the current shpae. The first attached (renamed to .txt not to confuse the cfbots) is a small patch that makes sure if _bt_readpage is called with the proper condition as written in its comment, that is, caller must have pinned and read-locked so->currPos.buf. This patch reveals many instances of breakage of the contract. The second is a crude fix the breakages, but the result seems far from neat.. I think we need rethinking taking modification of support functions into consideration. > I think making this depend on things like declared cursor etc. is > going > to be tricky, may easily be more complex than checking deleted tuples, > and the behavior may be quite surprising. Sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From de129e5a261ed43f002c1684dc9d6575f3880b16 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 6 Feb 2020 14:31:36 +0900 Subject: [PATCH 1/2] debug aid --- src/backend/access/nbtree/nbtsearch.c | 1 + src/backend/storage/buffer/bufmgr.c | 13 +++++++++++++ src/include/storage/bufmgr.h | 1 + 3 files changed, 15 insertions(+) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index c5f5d228f2..5cd97d8bb5 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1785,6 +1785,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * used here; this function is what makes it good for currPos. */ Assert(BufferIsValid(so->currPos.buf)); + Assert(BufferLockAndPinHeldByMe(so->currPos.buf)); page = BufferGetPage(so->currPos.buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index aba3960481..08a75a6846 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1553,6 +1553,19 @@ ReleaseAndReadBuffer(Buffer buffer, return ReadBuffer(relation, blockNum); } +/* tmp function for debugging */ +bool +BufferLockAndPinHeldByMe(Buffer buffer) +{ + BufferDesc *b = GetBufferDescriptor(buffer - 1); + + if (BufferIsPinned(buffer) && + LWLockHeldByMe(BufferDescriptorGetContentLock(b))) + return true; + + return false; +} + /* * PinBuffer -- make buffer unavailable for replacement. * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 73c7e9ba38..8e5fc639a0 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -177,6 +177,7 @@ extern void MarkBufferDirty(Buffer buffer); extern void IncrBufferRefCount(Buffer buffer); extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation, BlockNumber blockNum); +extern bool BufferLockAndPinHeldByMe(Buffer buffer); extern void InitBufferPool(void); extern void InitBufferPoolAccess(void); -- 2.18.2 From 912bad2ec8c66ccd01cebf1f69233b004c633243 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 6 Feb 2020 19:09:09 +0900 Subject: [PATCH 2/2] crude fix --- src/backend/access/nbtree/nbtsearch.c | 43 +++++++++++++++++---------- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 5cd97d8bb5..1f18b38ca5 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1619,6 +1619,9 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid); + if (nextOffset != startOffset) + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + while (nextOffset == startOffset) { IndexTuple itup; @@ -1653,7 +1656,7 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, offnum = OffsetNumberPrev(offnum); /* Check if _bt_readpage returns already found item */ - if (!_bt_readpage(scan, indexdir, offnum)) + if (!_bt_readpage(scan, dir, offnum)) { /* * There's no actually-matching data on this page. Try to @@ -1668,6 +1671,8 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, return false; } } + else + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); currItem = &so->currPos.items[so->currPos.lastItem]; itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); @@ -1721,24 +1726,30 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, } /* Now read the data */ - if (!_bt_readpage(scan, indexdir, offnum)) + if (!(ScanDirectionIsForward(dir) && + ScanDirectionIsBackward(indexdir)) || + scanstart) { - /* - * There's no actually-matching data on this page. Try to advance to - * the next page. Return false if there's no matching data at all. - */ - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); - if (!_bt_steppage(scan, dir)) + if (!_bt_readpage(scan, dir, offnum)) { - pfree(so->skipScanKey); - so->skipScanKey = NULL; - return false; + /* + * There's no actually-matching data on this page. Try to advance + * to the next page. Return false if there's no matching data at + * all. + */ + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + if (!_bt_steppage(scan, dir)) + { + pfree(so->skipScanKey); + so->skipScanKey = NULL; + return false; + } + } + else + { + /* Drop the lock, and maybe the pin, on the current page */ + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); } - } - else - { - /* Drop the lock, and maybe the pin, on the current page */ - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); } /* And set IndexTuple */ -- 2.18.2
> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote: > The first attached (renamed to .txt not to confuse the cfbots) is a > small patch that makes sure if _bt_readpage is called with the proper > condition as written in its comment, that is, caller must have pinned > and read-locked so->currPos.buf. This patch reveals many instances of > breakage of the contract. Thanks! On top of which patch version one can apply it? I'm asking because I believe I've addressed similar issues in the last version, and the last proposed diff (after resolving some conflicts) breaks tests for me, so not sure if I miss something. At the same time if you and Tomas strongly agree that it actually makes sense to make moving forward/reading backward case work with dead tuples correctly, I'll take a shot and try to teach the code around _bt_skip to do what is required for that. I can merge your changes there and we can see what would be the result.
> On Fri, Feb 14, 2020 at 01:18:20PM +0100, Dmitry Dolgov wrote: > > On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote: > > The first attached (renamed to .txt not to confuse the cfbots) is a > > small patch that makes sure if _bt_readpage is called with the proper > > condition as written in its comment, that is, caller must have pinned > > and read-locked so->currPos.buf. This patch reveals many instances of > > breakage of the contract. > > Thanks! On top of which patch version one can apply it? I'm asking > because I believe I've addressed similar issues in the last version, and > the last proposed diff (after resolving some conflicts) breaks tests for > me, so not sure if I miss something. > > At the same time if you and Tomas strongly agree that it actually makes > sense to make moving forward/reading backward case work with dead tuples > correctly, I'll take a shot and try to teach the code around _bt_skip to > do what is required for that. I can merge your changes there and we can > see what would be the result. Here is something similar to what I had in mind. In this version of the patch IndexOnlyNext now verify if we returned to the same position as before while reading in opposite to the advancing direction due to visibility checks (similar to what is implemented inside _bt_skip for the situation when some distinct keys being eliminated due to scankey conditions). It's actually not that invasive as I feared, but still pretty hacky. I'm not sure if it's ok to compare resulting heaptid in this situation, but all the mention tests are passing. Also, this version doesn't incorporate any planner feedback from Tomas yet, my intention is just to check if it could be the right direction.
Attachment
On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Here is something similar to what I had in mind. (changing to this email address for future emails) Hi, I've been looking over v32 of the patch and have a few comments regarding the planner changes. I think the changes in create_distinct_paths() need more work. The way I think this should work is that create_distinct_paths() gets to know exactly nothing about what path types support the elimination of duplicate values. The Path should carry the UniqueKeys so that can be determined. In create_distinct_paths() you should just be able to make use of those paths, which should already have been created when creating index paths for the rel due to PlannerInfo's query_uniquekeys having been set. The reason it must be done this way is that when the RelOptInfo that we're performing the DISTINCT on is a joinrel, then we're not going to see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort of Join path instead. I understand you're not yet supporting doing this optimisation when joins are involved, but it should be coded in such a way that it'll work when we do. (It's probably also a separate question as to whether we should have this only work when there are no joins. I don't think I personally object to it for stage 1, but perhaps someone else might think differently.) For storing these new paths with UniqueKeys, I'm not sure exactly if we can just add_path() such paths into the RelOptInfo's pathlist. What we don't want to do is accidentally make use of paths which eliminate duplicate values when we don't want that behaviour. If we did store these paths in RelOptInfo->pathlist then we'd need to go and modify a bunch of places to ignore such paths. set_cheapest() would have to do something special for them too, which makes me think pathlist is the incorrect place. Parallel query added partial_pathlist, so perhaps we need unique_pathlist to make this work. Also, should create_grouping_paths() be getting the same code? Jesper's UniqueKey patch seems to set query_uniquekeys when there's a GROUP BY with no aggregates. So it looks like he has intended that something like: SELECT x FROM t GROUP BY x; should work the same way as SELECT DISTINCT x FROM t; but the 0002 patch does not make this work. Has that just been overlooked? There's also some weird looking assumptions that an EquivalenceMember can only be a Var in create_distinct_paths(). I think you're only saved from crashing there because a ProjectionPath will be created atop of the IndexPath to evaluate expressions, in which case you're not seeing the IndexPath. This results in the optimisation not working in cases like: postgres=# create table t (a int); create index on t ((a+1)); explain select distinct a+1 from t; CREATE TABLE CREATE INDEX QUERY PLAN ----------------------------------------------------------- HashAggregate (cost=48.25..50.75 rows=200 width=4) Group Key: (a + 1) -> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) Using unique paths as I mentioned above should see that fixed. David
On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: >On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> Here is something similar to what I had in mind. > >(changing to this email address for future emails) > >Hi, > >I've been looking over v32 of the patch and have a few comments >regarding the planner changes. > >I think the changes in create_distinct_paths() need more work. The >way I think this should work is that create_distinct_paths() gets to >know exactly nothing about what path types support the elimination of >duplicate values. The Path should carry the UniqueKeys so that can be >determined. In create_distinct_paths() you should just be able to make >use of those paths, which should already have been created when >creating index paths for the rel due to PlannerInfo's query_uniquekeys >having been set. > +1 to code this in a generic way, using query_uniquekeys (if possible) >The reason it must be done this way is that when the RelOptInfo that >we're performing the DISTINCT on is a joinrel, then we're not going to >see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort >of Join path instead. I understand you're not yet supporting doing >this optimisation when joins are involved, but it should be coded in >such a way that it'll work when we do. (It's probably also a separate >question as to whether we should have this only work when there are no >joins. I don't think I personally object to it for stage 1, but >perhaps someone else might think differently.) > I don't follow. Can you elaborate more? AFAICS skip-scan is essentially a capability of an (index) AM. I don't see how we could ever do that for joinrels? We can do that at the scan level, below a join, but that's what this patch already supports, I think. When you say "supporting this optimisation" with joins, do you mean doing skip-scan for join inputs, or on top of the join? >For storing these new paths with UniqueKeys, I'm not sure exactly if >we can just add_path() such paths into the RelOptInfo's pathlist. >What we don't want to do is accidentally make use of paths which >eliminate duplicate values when we don't want that behaviour. If we >did store these paths in RelOptInfo->pathlist then we'd need to go and >modify a bunch of places to ignore such paths. set_cheapest() would >have to do something special for them too, which makes me think >pathlist is the incorrect place. Parallel query added >partial_pathlist, so perhaps we need unique_pathlist to make this >work. > Hmmm, good point. Do we actually produce incorrect plans with the current patch, using skip-scan path when we should not? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, 7 Mar 2020 at 03:49, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > >The reason it must be done this way is that when the RelOptInfo that > >we're performing the DISTINCT on is a joinrel, then we're not going to > >see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort > >of Join path instead. I understand you're not yet supporting doing > >this optimisation when joins are involved, but it should be coded in > >such a way that it'll work when we do. (It's probably also a separate > >question as to whether we should have this only work when there are no > >joins. I don't think I personally object to it for stage 1, but > >perhaps someone else might think differently.) > > > > I don't follow. Can you elaborate more? > > AFAICS skip-scan is essentially a capability of an (index) AM. I don't > see how we could ever do that for joinrels? We can do that at the scan > level, below a join, but that's what this patch already supports, I > think. When you say "supporting this optimisation" with joins, do you > mean doing skip-scan for join inputs, or on top of the join? The skip index scan Path would still be created at the base rel level, but the join path on the join relation would have one of the sub-paths of the join as an index skip scan. An example query that could make use of this is: SELECT * FROM some_table WHERE a IN(SELECT indexed_col_with_few_distinct_values FROM big_table); In this case, we might want to create a Skip Scan path on big_table using the index on the "indexed_col_with_few_distinct_values", then Hash Join to "some_table". That class of query is likely stage 2 or 3 of this work, but we need to lay foundations that'll support it. As for not having IndexScan paths in joinrels. Yes, of course, but that's exactly why create_distinct_paths() cannot work the way the patch currently codes it. The patch does: + /* + * XXX: In case of index scan quals evaluation happens + * after ExecScanFetch, which means skip results could be + * fitered out. Consider the following query: + * + * select distinct (a, b) a, b, c from t where c < 100; + * + * Skip scan returns one tuple for one distinct set of (a, + * b) with arbitrary one of c, so if the choosed c does + * not match the qual and there is any c that matches the + * qual, we miss that tuple. + */ + if (path->pathtype == T_IndexScan && which will never work for join relations since they'll only have paths for Loop/Merge/Hash type joins. The key here is to determine which skip scan paths we should create when we're building the normal index paths then see if we can make use of those when planning joins. Subsequently, we'll then see if we can make use of the resulting join paths during create_distinct_paths(). Doing it this way will allow us to use skip scans in queries such as: SELECT DISTINCT t1.z FROM t1 INNER JOIN t2 ON t1.a = t2.unique_col; We'll first create the skip scan paths on t1, then when creating the join paths we'll create additional join paths which use the skipscan path. Because t1.unique_col will at most have 1 join partner for each t2 row, then the join path will have the same unique_keys as the skipscan path. That'll allow us to use the join path which has the skip scan on whichever side of the join the t1 relation ends up. All create_distinct_paths() should be doing is looking for paths that are already implicitly unique on the distinct clause and consider using those in a cost-based way. It shouldn't be making such paths itself. > >For storing these new paths with UniqueKeys, I'm not sure exactly if > >we can just add_path() such paths into the RelOptInfo's pathlist. > >What we don't want to do is accidentally make use of paths which > >eliminate duplicate values when we don't want that behaviour. If we > >did store these paths in RelOptInfo->pathlist then we'd need to go and > >modify a bunch of places to ignore such paths. set_cheapest() would > >have to do something special for them too, which makes me think > >pathlist is the incorrect place. Parallel query added > >partial_pathlist, so perhaps we need unique_pathlist to make this > >work. > > > > Hmmm, good point. Do we actually produce incorrect plans with the > current patch, using skip-scan path when we should not? I don't think so. The patch is only creating skip scan paths on the base rel when we discover it's valid to do so. That's not the way it should work though. How the patch currently works would be similar to initially only creating a SeqScan path for a query such as: SELECT * FROM tab ORDER BY a;, but then, during create_ordered_paths() go and create some IndexPath to scan the btree index on tab.a because we suddenly realise that it'll be good to use that for the ORDER BY. The planner does not work that way. We always create all the paths that we think will be useful during set_base_rel_pathlists(). We then make use of only existing paths in the upper planner. See what build_index_paths() in particular: /* see if we can generate ordering operators for query_pathkeys */ match_pathkeys_to_index(index, root->query_pathkeys, &orderbyclauses, &orderbyclausecols); We'll need something similar to that but for the query_uniquekeys and ensure we build the skip scan paths when we think they'll be useful and do so during the call to set_base_rel_pathlists(). Later in stage 2 or 3, we can go build skip scan paths when there are semi/anti joins that could make use of them. Making that work will just be some plumbing work in build_index_paths() and making use of those paths during add_paths_to_joinrel(). David
> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > I've been looking over v32 of the patch and have a few comments > regarding the planner changes. Thanks for the commentaries! > I think the changes in create_distinct_paths() need more work. The > way I think this should work is that create_distinct_paths() gets to > know exactly nothing about what path types support the elimination of > duplicate values. The Path should carry the UniqueKeys so that can be > determined. In create_distinct_paths() you should just be able to make > use of those paths, which should already have been created when > creating index paths for the rel due to PlannerInfo's query_uniquekeys > having been set. Just for me to clarify. The idea is to "move" information about what path types support skipping into UniqueKeys (derived from PlannerInfo's query_uniquekeys), but other checks (e.g. if index am supports that) still perform in create_distinct_paths? > Also, should create_grouping_paths() be getting the same code? > Jesper's UniqueKey patch seems to set query_uniquekeys when there's a > GROUP BY with no aggregates. So it looks like he has intended that > something like: > > SELECT x FROM t GROUP BY x; > > should work the same way as > > SELECT DISTINCT x FROM t; > > but the 0002 patch does not make this work. Has that just been overlooked? I believe it wasn't overlooked in 0002 patch, but rather added just in case in 0001. I guess there are no theoretical problems in implementing it, but since we wanted to keep scope of the patch under control and concentrate on the existing functionality it probably makes sense to plan it as one of the next steps? > There's also some weird looking assumptions that an EquivalenceMember > can only be a Var in create_distinct_paths(). I think you're only > saved from crashing there because a ProjectionPath will be created > atop of the IndexPath to evaluate expressions, in which case you're > not seeing the IndexPath. This results in the optimisation not > working in cases like: > > postgres=# create table t (a int); create index on t ((a+1)); explain > select distinct a+1 from t; > CREATE TABLE > CREATE INDEX > QUERY PLAN > ----------------------------------------------------------- > HashAggregate (cost=48.25..50.75 rows=200 width=4) > Group Key: (a + 1) > -> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) Yes, I need to fix it. > Using unique paths as I mentioned above should see that fixed. I'm a bit confused about this statement, how exactly unique paths should fix this?
On Mon, 9 Mar 2020 at 03:21, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > I've been looking over v32 of the patch and have a few comments > > regarding the planner changes. > > Thanks for the commentaries! > > > I think the changes in create_distinct_paths() need more work. The > > way I think this should work is that create_distinct_paths() gets to > > know exactly nothing about what path types support the elimination of > > duplicate values. The Path should carry the UniqueKeys so that can be > > determined. In create_distinct_paths() you should just be able to make > > use of those paths, which should already have been created when > > creating index paths for the rel due to PlannerInfo's query_uniquekeys > > having been set. > > Just for me to clarify. The idea is to "move" information about what > path types support skipping into UniqueKeys (derived from PlannerInfo's > query_uniquekeys), but other checks (e.g. if index am supports that) > still perform in create_distinct_paths? create_distinct_paths() shouldn't know any details specific to the pathtype that it's using or considering using. All the details should just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be no IsA(path, ...). Please have a look over the details in my reply to Tomas. I hope that reply has enough information in it, but please reply there if I've missed something. > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > There's also some weird looking assumptions that an EquivalenceMember > > can only be a Var in create_distinct_paths(). I think you're only > > saved from crashing there because a ProjectionPath will be created > > atop of the IndexPath to evaluate expressions, in which case you're > > not seeing the IndexPath. This results in the optimisation not > > working in cases like: > > > > postgres=# create table t (a int); create index on t ((a+1)); explain > > select distinct a+1 from t; > > CREATE TABLE > > CREATE INDEX > > QUERY PLAN > > ----------------------------------------------------------- > > HashAggregate (cost=48.25..50.75 rows=200 width=4) > > Group Key: (a + 1) > > -> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) > > Yes, I need to fix it. > > > Using unique paths as I mentioned above should see that fixed. > > I'm a bit confused about this statement, how exactly unique paths should > fix this? The path's uniquekeys would mention that it's unique on (a+1). You'd compare the uniquekeys of the path to the DISTINCT clause and see that the uniquekeys are a subset of the DISTINCT clause therefore the DISTINCT is a no-op. If that uniquekey path is cheaper than the cheapest_total_path + <cost of uniquification method>, then you should pick the unique path, otherwise use the cheapest_total_path and uniquify that. I think the UniqueKeys may need to be changed from using EquivalenceClasses to use Exprs instead.
> On Mon, Mar 09, 2020 at 10:27:26AM +1300, David Rowley wrote: > > > > I think the changes in create_distinct_paths() need more work. The > > > way I think this should work is that create_distinct_paths() gets to > > > know exactly nothing about what path types support the elimination of > > > duplicate values. The Path should carry the UniqueKeys so that can be > > > determined. In create_distinct_paths() you should just be able to make > > > use of those paths, which should already have been created when > > > creating index paths for the rel due to PlannerInfo's query_uniquekeys > > > having been set. > > > > Just for me to clarify. The idea is to "move" information about what > > path types support skipping into UniqueKeys (derived from PlannerInfo's > > query_uniquekeys), but other checks (e.g. if index am supports that) > > still perform in create_distinct_paths? > > create_distinct_paths() shouldn't know any details specific to the > pathtype that it's using or considering using. All the details should > just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be > no IsA(path, ...). Please have a look over the details in my reply to > Tomas. I hope that reply has enough information in it, but please > reply there if I've missed something. Yes, I've read this reply, just wanted to ask here, since I had other questions as well. Speaking of which: > > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > > There's also some weird looking assumptions that an EquivalenceMember > > > can only be a Var in create_distinct_paths(). I think you're only > > > saved from crashing there because a ProjectionPath will be created > > > atop of the IndexPath to evaluate expressions, in which case you're > > > not seeing the IndexPath. I'm probably missing something, so to eliminate any misunderstanding from my side: > > > This results in the optimisation not working in cases like: > > > > > > postgres=# create table t (a int); create index on t ((a+1)); explain > > > select distinct a+1 from t; > > > CREATE TABLE > > > CREATE INDEX > > > QUERY PLAN > > > ----------------------------------------------------------- > > > HashAggregate (cost=48.25..50.75 rows=200 width=4) > > > Group Key: (a + 1) > > > -> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) In this particular example skipping is not applied because, as you've mentioned, we're dealing with ProjectionPath (not IndexScan / IndexOnlyScan). Which means we're not even reaching the code with EquivalenceMember, so I'm still not sure how do they connected? Assuming we'll implement it in a way that we do not know about what kind of path type is that in create_distinct_path, then it can also work for ProjectionPath or anything else (if UniqueKeys are present). But then still EquivalenceMember are used only to figure out correct distinctPrefixKeys and do not affect whether or not skipping is applied. What do I miss?
On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Assuming we'll implement it in a way that we do not know about what kind > of path type is that in create_distinct_path, then it can also work for > ProjectionPath or anything else (if UniqueKeys are present). But then > still EquivalenceMember are used only to figure out correct > distinctPrefixKeys and do not affect whether or not skipping is applied. > What do I miss? I'm not sure I fully understand the question correctly, but let me explain further. In the 0001 patch, standard_qp_callback sets the query_uniquekeys depending on the DISTINCT / GROUP BY clause. When building index paths in build_index_paths(), the 0002 patch should be looking at the root->query_uniquekeys to see if it can build any index paths that suit those keys. Such paths should be tagged with the uniquekeys they satisfy, basically, exactly the same as how pathkeys work. Many create_*_path functions will need to be modified to carry forward their uniquekeys. For example, create_projection_path(), create_limit_path() don't do anything which would cause the created path to violate the unique keys. This way when you get down to create_distinct_paths(), paths other than IndexPath may have uniquekeys. You'll be able to check which existing paths satisfy the unique keys required by the DISTINCT / GROUP BY and select those paths instead of having to create any HashAggregate / Unique paths. Does that answer the question?
On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Assuming we'll implement it in a way that we do not know about what kind > of path type is that in create_distinct_path, then it can also work for > ProjectionPath or anything else (if UniqueKeys are present). But then > still EquivalenceMember are used only to figure out correct > distinctPrefixKeys and do not affect whether or not skipping is applied. > What do I miss? Part of the puzzle seems to me to this part of the response: > I think the UniqueKeys may need to be changed from using > EquivalenceClasses to use Exprs instead. But I can't say I'm being overly helpful by pointing that out, since I don't have my head in the code enough to understand how you'd accomplish that :) James
> >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote: > On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Assuming we'll implement it in a way that we do not know about what kind > > of path type is that in create_distinct_path, then it can also work for > > ProjectionPath or anything else (if UniqueKeys are present). But then > > still EquivalenceMember are used only to figure out correct > > distinctPrefixKeys and do not affect whether or not skipping is applied. > > What do I miss? > > I'm not sure I fully understand the question correctly, but let me > explain further. > > In the 0001 patch, standard_qp_callback sets the query_uniquekeys > depending on the DISTINCT / GROUP BY clause. When building index > paths in build_index_paths(), the 0002 patch should be looking at the > root->query_uniquekeys to see if it can build any index paths that > suit those keys. Such paths should be tagged with the uniquekeys they > satisfy, basically, exactly the same as how pathkeys work. Many > create_*_path functions will need to be modified to carry forward > their uniquekeys. For example, create_projection_path(), > create_limit_path() don't do anything which would cause the created > path to violate the unique keys. This way when you get down to > create_distinct_paths(), paths other than IndexPath may have > uniquekeys. You'll be able to check which existing paths satisfy the > unique keys required by the DISTINCT / GROUP BY and select those paths > instead of having to create any HashAggregate / Unique paths. > > Does that answer the question? Hmm... I'm afraid no, this was already clear. But looks like now I see that I've misinterpreted one part. > There's also some weird looking assumptions that an EquivalenceMember > can only be a Var in create_distinct_paths(). I think you're only > saved from crashing there because a ProjectionPath will be created > atop of the IndexPath to evaluate expressions, in which case you're > not seeing the IndexPath. This results in the optimisation not > working in cases like: I've read it as "an assumption that an EquivalenceMember can only be a Var" results in "the optimisation not working in cases like this". But you've meant that ignoring a ProjectionPath with an IndexPath inside results in this optimisation not working, right? If so, then everything is clear, and my apologies, maybe I need to finally fix my sleep schedule :)
On Wed, 11 Mar 2020 at 01:38, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote: > > There's also some weird looking assumptions that an EquivalenceMember > > can only be a Var in create_distinct_paths(). I think you're only > > saved from crashing there because a ProjectionPath will be created > > atop of the IndexPath to evaluate expressions, in which case you're > > not seeing the IndexPath. This results in the optimisation not > > working in cases like: > > I've read it as "an assumption that an EquivalenceMember can only be a > Var" results in "the optimisation not working in cases like this". But > you've meant that ignoring a ProjectionPath with an IndexPath inside > results in this optimisation not working, right? If so, then everything > is clear, and my apologies, maybe I need to finally fix my sleep > schedule :) Yes, I was complaining that a ProjectionPath breaks the optimisation and I don't believe there's any reason that it should. I believe the way to make that work correctly requires paying attention to the Path's uniquekeys rather than what type of path it is.
I think the UniqueKeys may need to be changed from using
EquivalenceClasses to use Exprs instead.
On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Assuming we'll implement it in a way that we do not know about what kind
> of path type is that in create_distinct_path, then it can also work for
> ProjectionPath or anything else (if UniqueKeys are present). But then
> still EquivalenceMember are used only to figure out correct
> distinctPrefixKeys and do not affect whether or not skipping is applied.
> What do I miss?
Part of the puzzle seems to me to this part of the response:
> I think the UniqueKeys may need to be changed from using
> EquivalenceClasses to use Exprs instead.
But I can't say I'm being overly helpful by pointing that out, since I
don't have my head in the code enough to understand how you'd
accomplish that :)
On Wed, 11 Mar 2020 at 16:44, Andy Fan <zhihui.fan1213@gmail.com> wrote: >> >> >> I think the UniqueKeys may need to be changed from using >> EquivalenceClasses to use Exprs instead. > > > When I try to understand why UniqueKeys needs EquivalenceClasses, > see your comments here. I feel that FuncExpr can't be > used to as a UniquePath even we can create unique index on f(a) > and f->strict == true. The reason is even we know a is not null, > f->strict = true. it is still be possible that f(a) == null. unique index > allows more than 1 null values. so shall we move further to use varattrno > instead of Expr? if so, we can also use a list of Bitmapset to present multi > unique path of a single RelOptInfo. We do need some method to determine if NULL values are possible. At the base relation level that can probably be done by checking NOT NULL constraints and strict base quals. At higher levels, we can use strict join quals as proofs. As for bit a Bitmapset of varattnos, that would certainly work well at the base relation level when there are no unique expression indexes, but it's not so simple with join relations when the varattnos only mean something when you know which base relation it comes from. I'm not saying that Lists of Exprs is ideal, but I think trying to optimise some code that does not yet exist is premature. There was some other talk in [1] on how we might make checking if a List contains a given Node. That could be advantageous in a few places in the query planner, and it might be useful for this too. [1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com
Hello hackers, Recently I've put some effort in extending the functionality of this patch. So far, we've been trying to keep the scope ofthis patch relatively small to DISTINCT-clauses only. The advantage of this approach was that it keeps impact to the indexamapi to a minimum. However, given the problems we've been facing in getting the implementation to work correctly inall cases, I started wondering if this implementation was the right direction to go in. My main worry is that the currentindexam api for skipping is not suited to other future use cases of skipping, but also that we're already strugglingwith it now to get it to work correctly in all edge cases. In the approach taken so far, the amskip function is defined taking two ScanDirection parameters. The function amgettupleis left unchanged. However, I think we need amgettuple to take two ScanDirection parameters as well (or createa separate function amgetskiptuple). This patch proposes that. Currently, I've just added 'skip' functions to the indexam api for beginscan and gettuple. Maybe it'd be better to just modifythe existing functions to take an extra parameter instead. Any thoughts on this? The result is a patch that can apply skipping in many more cases than previous patches. For example, filtering on columnsthat are not part of the index, properly handling visibility checks without moving these into the nbtree code, skippingnot only on prefix but also on extra conditions that become available (eg. prefix a=1 and we happen to have a WHEREclause with b=200, which we can now use to skip all the way to a=1 AND b=200). There's a fair amount of changes in thenbtree code to support this. Patch 0001 is Jesper's unique keys patch. Patch 0002 modifies executor-level code to support skip scans and enables it for DISTINCT queries. Patch 0003 just provides a very basic planner hack that enables the skip scan for practically all index scans (index, indexonly and bitmap index). This is not what we would normally want, but this way I could easily test the skip scan code.It's so large because I modify all the test cases expected results that now include an extra line 'Skip scan: All'.The actual code changed is only a few lines in this patch. The planner part of the code still needs work. The planner code in this patch is similar to the previous patch. David's commentsabout projection paths haven't been addressed yet. Also, there's no proper way of hooking up the index scan for regular(non-DISTINCT) queries yet. That's why I hacked up patch 0003 just to test stuff. I'd welcome any help on these patches. If someone with more planner knowledge than me is willing to do part of the plannercode, please feel free to do so. I believe supporting this will speed up a large number of queries for all kinds ofusers. It can be a really powerful feature. Tomas, would you be willing to repeat the performance tests you did earlier? I believe this version will perform better thanthe previous patch for the cases where you noticed the 10-20x slow-down. There will obviously still be a performancepenalty for cases where the planner picks a skip scan that are not well suited, but I think it'll be smaller. -Floris ----- To give a few simple examples: Initialization: -- table t1 has 100 unique values for a -- and 10000 b values for each a -- very suitable for skip scan create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 100) a, generate_series(1,10000) b; create index on t1 (a,b,c); -- table t2 has 10000 unique values for a -- and 100 b values for each a -- this is not very suitable for skip scan -- because the next matching value is always either -- on the current page or on the next page create table t2 as select a,b,b%5 as c, random() as d from generate_series(1, 10000) a, generate_series(1,100) b; create index on t2 (a,b,c); analyze t1; analyze t2; -- First 'Execution Time' line is this patched version (0001+0002+0003) (without including 0003, the non-DISTINCT querieswould be equal to master) -- Second 'Execution Time' line is master -- Third 'Execution Time' is previous skip scan patch version -- Just ran a couple of times to give an indication -- on order of magnitude, not a full benchmark. select distinct on (a) * from t1; Execution Time: 1.407 ms (current patch) Execution Time: 480.704 ms (master) Execution Time: 1.711 ms (previous patch) select distinct on (a) * from t1 where b > 50; Execution Time: 1.432 ms Execution Time: 481.530 ms Execution Time: 481.206 ms select distinct on (a) * from t1 where b > 9990; Execution Time: 1.074 ms Execution Time: 33.937 ms Execution Time: 33.115 ms select distinct on (a) * from t1 where d > 0.5; Execution Time: 0.811 ms Execution Time: 446.549 ms Execution Time: 436.091 ms select * from t1 where b=50; Execution Time: 1.111 ms Execution Time: 33.242 ms Execution Time: 36.555 ms select * from t1 where b between 50 and 75 and d > 0.5; Execution Time: 2.370 ms Execution Time: 60.744 ms Execution Time: 62.820 ms select * from t1 where b in (100, 200); Execution Time: 2.464 ms Execution Time: 252.224 ms Execution Time: 244.872 ms select * from t1 where b in (select distinct a from t1); Execution Time: 91.000 ms Execution Time: 842.969 ms Execution Time: 386.871 ms select distinct on (a) * from t2; Execution Time: 47.155 ms Execution Time: 714.102 ms Execution Time: 56.327 ms select distinct on (a) * from t2 where b > 5; Execution Time: 60.100 ms Execution Time: 709.960 ms Execution Time: 727.949 ms select distinct on (a) * from t2 where b > 95; Execution Time: 55.420 ms Execution Time: 71.007 ms Execution Time: 69.229 ms select distinct on (a) * from t2 where d > 0.5; Execution Time: 49.254 ms Execution Time: 719.820 ms Execution Time: 705.991 ms -- slight performance degradation here compared to regular index scan -- due to data unfavorable data distribution select * from t2 where b=50; Execution Time: 47.603 ms Execution Time: 37.327 ms Execution Time: 40.448 ms select * from t2 where b between 50 and 75 and d > 0.5; Execution Time: 244.546 ms Execution Time: 228.579 ms Execution Time: 227.541 ms select * from t2 where b in (100, 200); Execution Time: 64.021 ms Execution Time: 242.905 ms Execution Time: 258.864 ms select * from t2 where b in (select distinct a from t2); Execution Time: 758.350 ms Execution Time: 1271.230 ms Execution Time: 761.311 ms I wrote a few things here about the method as well: https://github.com/fvannee/postgres/wiki/Index-Skip-Scan Code can be found there on Github as well in branch 'skip-scan'
Attachment
It seems that the documentation build was broken. I've fixed it in attached patch. I'm unsure which version number to give this patch (to continue with numbers from previous skip scan patches, or to startnumbering from scratch again). It's a rather big change, so one could argue it's mostly a separate patch. I guess itmostly depends on how close the original versions were to be committable. Thoughts? -Floris
Attachment
Hi Floris, On Sun, Mar 22, 2020 at 11:00 AM Floris Van Nee <florisvannee@optiver.com> wrote: > create index on t1 (a,b,c); > select * from t1 where b in (100, 200); > Execution Time: 2.464 ms > Execution Time: 252.224 ms > Execution Time: 244.872 ms Wow. This is very cool work and I'm sure it will become a major headline feature of PG14 if the requisite planner brains can be sorted out. On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee <florisvannee@optiver.com> wrote: > I'm unsure which version number to give this patch (to continue with numbers from previous skip scan patches, or to startnumbering from scratch again). It's a rather big change, so one could argue it's mostly a separate patch. I guess itmostly depends on how close the original versions were to be committable. Thoughts? I don't know, but from the sidelines, it'd be nice to see the unique path part go into PG13, where IIUC it can power the "useless unique removal" patch.
On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee <florisvannee@optiver.com> wrote:
> I'm unsure which version number to give this patch (to continue with numbers from previous skip scan patches, or to start numbering from scratch again). It's a rather big change, so one could argue it's mostly a separate patch. I guess it mostly depends on how close the original versions were to be committable. Thoughts?
I don't know, but from the sidelines, it'd be nice to see the unique
path part go into PG13, where IIUC it can power the "useless unique
removal" patch.
> On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote: > > Yes, I was complaining that a ProjectionPath breaks the optimisation > and I don't believe there's any reason that it should. > > I believe the way to make that work correctly requires paying > attention to the Path's uniquekeys rather than what type of path it > is. Thanks for the suggestion. As a result of the discussion I've modified the patch, does it look similar to what you had in mind? In this version if all conditions are met and there are corresponding unique keys, a new index skip scan path will be added to unique_pathlists. In case if requested distinct clauses match with unique keys, create_distinct_paths can choose this path without needen to know what kind of path is it. Also unique_keys are passed through ProjectionPath, so optimization for the example mentioned in this thread before now should work (I've added one test for that). I haven't changed anything about UniqueKey structure itself (one of the suggestions was about Expr instead of EquivalenceClass), but I believe we need anyway to figure out how two existing imlementation (in this patch and from [1]) of this idea can be connected. [1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
Attachment
> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote: > > There was a dedicated thread [1] where David explain his idea very > detailed, and you can also check that messages around that message for > the context. hope it helps. Thanks for pointing out to this thread! Somehow I've missed it, and now looks like we need to make some efforts to align patches for index skip scan and distincClause elimination.
> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
>
> There was a dedicated thread [1] where David explain his idea very
> detailed, and you can also check that messages around that message for
> the context. hope it helps.
Thanks for pointing out to this thread! Somehow I've missed it, and now
looks like we need to make some efforts to align patches for index skip
scan and distincClause elimination.
On Tue, Mar 24, 2020 at 10:08 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote: > > > > Yes, I was complaining that a ProjectionPath breaks the optimisation > > and I don't believe there's any reason that it should. > > > > I believe the way to make that work correctly requires paying > > attention to the Path's uniquekeys rather than what type of path it > > is. > > Thanks for the suggestion. As a result of the discussion I've modified > the patch, does it look similar to what you had in mind? > > In this version if all conditions are met and there are corresponding > unique keys, a new index skip scan path will be added to > unique_pathlists. In case if requested distinct clauses match with > unique keys, create_distinct_paths can choose this path without needen > to know what kind of path is it. Also unique_keys are passed through > ProjectionPath, so optimization for the example mentioned in this thread > before now should work (I've added one test for that). > > I haven't changed anything about UniqueKey structure itself (one of the > suggestions was about Expr instead of EquivalenceClass), but I believe > we need anyway to figure out how two existing imlementation (in this > patch and from [1]) of this idea can be connected. > > [1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com --- src/backend/nodes/outfuncs.c | 14 ++++++ src/backend/nodes/print.c | 39 +++++++++++++++ src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/allpaths.c | 8 +++ src/backend/optimizer/path/indxpath.c | 41 ++++++++++++++++ src/backend/optimizer/path/pathkeys.c | 71 ++++++++++++++++++++++----- src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 +++++++++++++- src/backend/optimizer/util/pathnode.c | 46 +++++++++++++---- src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 +++++++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 +++++ 15 files changed, 272 insertions(+), 23 deletions(-) Seems like you forgot to add the uniquekey.c file in the v33-0001-Unique-key.patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote: > > Seems like you forgot to add the uniquekey.c file in the > v33-0001-Unique-key.patch. Oh, you're right, thanks. Here is the corrected patch.
Attachment
On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote: > > > > Seems like you forgot to add the uniquekey.c file in the > > v33-0001-Unique-key.patch. > > Oh, you're right, thanks. Here is the corrected patch. I was just wondering how the distinct will work with the "skip scan" if we have some filter? I mean every time we select the unique row based on the prefix key and that might get rejected by an external filter right? So I tried an example to check this. postgres[50006]=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | Indexes: "idx" btree (a, b) postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# set enable_indexskipscan =off; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- 1 2 (2 rows) postgres[50006]=# set enable_indexskipscan =on; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- (0 rows) postgres[50006]=# explain select distinct(a) from t where b%100=0; QUERY PLAN ------------------------------------------------------------------- Index Only Scan using idx on t (cost=0.15..1.55 rows=10 width=4) Skip scan: true Filter: ((b % 100) = 0) (3 rows) I think in such cases we should not select the skip scan. This should behave like we have a filter on the non-index field. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > I was just wondering how the distinct will work with the "skip scan" > if we have some filter? I mean every time we select the unique row > based on the prefix key and that might get rejected by an external > filter right? Not exactly. In the case of index-only scan, we skipping to the first unique position, and then use already existing functionality (_bt_readpage with stepping to the next pages) to filter out those records that do not pass the condition. There are even couple of tests in the patch for this. In case of index scan, when there are some conditions, current implementation do not consider skipping. > So I tried an example to check this. Can you tell on which version of the patch you were testing?
On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > I was just wondering how the distinct will work with the "skip scan" > > if we have some filter? I mean every time we select the unique row > > based on the prefix key and that might get rejected by an external > > filter right? > > Not exactly. In the case of index-only scan, we skipping to the first > unique position, and then use already existing functionality > (_bt_readpage with stepping to the next pages) to filter out those > records that do not pass the condition. I agree but that will work if we have a valid index clause, but "b%100=0" condition will not create an index clause, right? However, if we change the query to select distinct(a) from t where b=100 then it works fine because this condition will create an index clause. There are even couple of tests > in the patch for this. In case of index scan, when there are some > conditions, current implementation do not consider skipping. > > > So I tried an example to check this. > > Can you tell on which version of the patch you were testing? I have tested on v33. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > if we have some filter? I mean every time we select the unique row > > > based on the prefix key and that might get rejected by an external > > > filter right? > > Yeah, you're correct. This patch only handles the index conditions and doesn't handle any filters correctly. There's a checkin the planner for the IndexScan for example that only columns that exist in the index are used. However, this checkis not sufficient as your example shows. There's a number of ways we can force a 'filter' rather than an 'index condition'and still choose a skip scan (WHERE b!=0 is another one I think). This leads to incorrect query results. This patch would need some logic in the planner to never choose the skip scan in these cases. Better long-term solution isto adapt the rest of the executor to work correctly in the cases of external filters (this ties in with the previous visibilitydiscussion as well, as that's basically also an external filter, although a special case). In the patch I posted a week ago these cases are all handled correctly, as it introduces this extra logic in the Executor. -Floris
On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > > if we have some filter? I mean every time we select the unique row > > > > based on the prefix key and that might get rejected by an external > > > > filter right? > > > > > Yeah, you're correct. This patch only handles the index conditions and doesn't handle any filters correctly. There's acheck in the planner for the IndexScan for example that only columns that exist in the index are used. However, this checkis not sufficient as your example shows. There's a number of ways we can force a 'filter' rather than an 'index condition'and still choose a skip scan (WHERE b!=0 is another one I think). This leads to incorrect query results. Right > This patch would need some logic in the planner to never choose the skip scan in these cases. Better long-term solutionis to adapt the rest of the executor to work correctly in the cases of external filters (this ties in with the previousvisibility discussion as well, as that's basically also an external filter, although a special case). I agree > In the patch I posted a week ago these cases are all handled correctly, as it introduces this extra logic in the Executor. Okay, So I think we can merge those fixes in Dmitry's patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee <florisvannee@optiver.com> wrote: > > > > There's a number of ways we can force a 'filter' rather than an > > 'index condition'. Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I would appreciate if in the future you can make it more visible that changes you suggest contain some fixes. E.g. it wasn't clear for me from your previous email that that's the case, and it doesn't make sense to pull into different direction when we're trying to achieve the same goal :) > > In the patch I posted a week ago these cases are all handled > > correctly, as it introduces this extra logic in the Executor. > > Okay, So I think we can merge those fixes in Dmitry's patch set. I'll definitely take a look at suggested changes in filtering part.
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I > would appreciate if in the future you can make it more visible that changes you > suggest contain some fixes. E.g. it wasn't clear for me from your previous email > that that's the case, and it doesn't make sense to pull into different direction > when we're trying to achieve the same goal :) I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned it hereof course. It's just that because my patch handles filter conditions in general, it works for this case too. > > > > In the patch I posted a week ago these cases are all handled > > > correctly, as it introduces this extra logic in the Executor. > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > I'll definitely take a look at suggested changes in filtering part. It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pullthe information about skipping one level up, out of the node, into the generic IndexNext code. I'm eager to get some form of skip scans into master - any kind of patch that makes this possible is fine by me. Long termI think my version provides a more generic approach, with which we can optimize a much broader range of queries. However,since many more eyes have seen your patch so far, I hope yours can be committed much sooner. My knowledge on thiscommitter process is limited though. That's why I've just posted mine so far in the hope of collecting some feedback,also on how we should continue with the process. -Floris
> On Mon, Apr 06, 2020 at 06:31:08PM +0000, Floris Van Nee wrote: > > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I > > would appreciate if in the future you can make it more visible that changes you > > suggest contain some fixes. E.g. it wasn't clear for me from your previous email > > that that's the case, and it doesn't make sense to pull into different direction > > when we're trying to achieve the same goal :) > > I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned ithere of course. It's just that because my patch handles filter conditions in general, it works for this case too. Oh, then fortunately I've got a wrong impression, sorry and thanks for clarification :) > > > > In the patch I posted a week ago these cases are all handled > > > > correctly, as it introduces this extra logic in the Executor. > > > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > > > I'll definitely take a look at suggested changes in filtering part. > > It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pullthe information about skipping one level up, out of the node, into the generic IndexNext code. I was actually thinking more about just preventing skip scan in this situation, which is if I'm not mistaken could be solved by inspecting qual conditions to figure out if they're covered in the index - something like in attachments (this implementation is actually too restrictive in this sense and will not allow e.g. expressions, that's why I haven't bumped patch set version for it - soon I'll post an extended version). Other than that to summarize current open points for future readers (this thread somehow became quite big): * Making UniqueKeys usage more generic to allow using skip scan for more use cases (hopefully it was covered by the v33, but I still need a confirmation from David, like blinking twice or something). * Suspicious performance difference between different type of workload, mentioned by Tomas (unfortunately I had no chance yet to investigate). * Thinking about supporting conditions, that are not covered by the index, to make skipping more flexible (one of the potential next steps in the future, as suggested by Floris).
Attachment
> > * Suspicious performance difference between different type of workload, > mentioned by Tomas (unfortunately I had no chance yet to investigate). > His benchmark results indeed most likely point to multiple comparisons being done. Since the most likely place where theseoccur is _bt_readpage, I suspect this is called multiple times. Looking at your patch, I think that's indeed the case.For example, suppose a page contains [1,2,3,4,5] and the planner makes a complete misestimation and chooses a skip scanhere. First call to _bt_readpage will compare every tuple on the page already and store everything in the workspace,which will now contain [1,2,3,4,5]. However, when a skip is done the elements on the page (not the workspace) arecompared to find the next one. Then, another _bt_readpage is done, starting at the new offnum. So we'll compare everytuple (except 1) on the page again. Workspace now contains [2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple5 actually gets compared 5 times in _bt_readpage alone. -Floris
On Wed, Apr 8, 2020 at 1:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Apr 06, 2020 at 06:31:08PM +0000, Floris Van Nee wrote: > > > > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I > > > would appreciate if in the future you can make it more visible that changes you > > > suggest contain some fixes. E.g. it wasn't clear for me from your previous email > > > that that's the case, and it doesn't make sense to pull into different direction > > > when we're trying to achieve the same goal :) > > > > I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentionedit here of course. It's just that because my patch handles filter conditions in general, it works for this casetoo. > > Oh, then fortunately I've got a wrong impression, sorry and thanks for > clarification :) > > > > > > In the patch I posted a week ago these cases are all handled > > > > > correctly, as it introduces this extra logic in the Executor. > > > > > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > > > > > I'll definitely take a look at suggested changes in filtering part. > > > > It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have topull the information about skipping one level up, out of the node, into the generic IndexNext code. > > I was actually thinking more about just preventing skip scan in this > situation, which is if I'm not mistaken could be solved by inspecting > qual conditions to figure out if they're covered in the index - > something like in attachments (this implementation is actually too > restrictive in this sense and will not allow e.g. expressions, that's > why I haven't bumped patch set version for it - soon I'll post an > extended version). Some more comments... + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); + _bt_update_skip_scankeys(scan, indexRel); + ....... + /* + * We haven't found scan key within the current page, so let's scan from + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset + * number + */ + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); + stack = _bt_search(scan->indexRelation, so->skipScanKey, + &buf, BT_READ, scan->xs_snapshot); Why do we need to set so->skipScanKey->nextkey = ScanDirectionIsForward(dir); multiple times? I think it is fine to just set it once? +static inline bool +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, + Buffer buf, ScanDirection dir) +{ + OffsetNumber low, high; + Page page = BufferGetPage(buf); + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + low = P_FIRSTDATAKEY(opaque); + high = PageGetMaxOffsetNumber(page); + + if (unlikely(high < low)) + return false; + + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && + _bt_compare(scan->indexRelation, key, page, high) < 1); +} I think the high key condition should be changed to _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if prefix qual is equal to the high key then also there is no point in searching on the current page so we can directly skip. + /* Check if an index skip scan is possible. */ + can_skip = enable_indexskipscan & index->amcanskip; + + /* + * Skip scan is not supported when there are qual conditions, which are not + * covered by index. The reason for that is that those conditions are + * evaluated later, already after skipping was applied. + * + * TODO: This implementation is too restrictive, and doesn't allow e.g. + * index expressions. For that we need to examine index_clauses too. + */ + if (root->parse->jointree != NULL) + { + ListCell *lc; + + foreach(lc, (List *)root->parse->jointree->quals) + { + Node *expr, *qual = (Node *) lfirst(lc); + Var *var; I think we can avoid checking for expression if can_skip is already false. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Other than that to summarize current open points for future readers > (this thread somehow became quite big): > > * Making UniqueKeys usage more generic to allow using skip scan for more > use cases (hopefully it was covered by the v33, but I still need a > confirmation from David, like blinking twice or something). I've not yet looked at the latest patch, but I did put some thoughts into an email on the other thread that's been discussing UniqueKeys [1]. I'm keen to hear thoughts on the plan I mentioned over there. Likely it would be best to discuss the specifics of what additional features we need to add to UniqueKeys for skip scans over here, but discuss any chances which affect both patches over there. We certainly can't have two separate implementations of UniqueKeys, so I believe the skip scans UniqueKeys patch should most likely be based on the one in [1] or some descendant of it. [1] https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=oHatCMd7pTUKkdq0B72_08nbR3Hw@mail.gmail.com
Sorry for late reply. > On Tue, Apr 14, 2020 at 09:19:22PM +1200, David Rowley wrote: > > I've not yet looked at the latest patch, but I did put some thoughts > into an email on the other thread that's been discussing UniqueKeys > [1]. > > I'm keen to hear thoughts on the plan I mentioned over there. Likely > it would be best to discuss the specifics of what additional features > we need to add to UniqueKeys for skip scans over here, but discuss any > chances which affect both patches over there. We certainly can't have > two separate implementations of UniqueKeys, so I believe the skip > scans UniqueKeys patch should most likely be based on the one in [1] > or some descendant of it. > > [1] https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=oHatCMd7pTUKkdq0B72_08nbR3Hw@mail.gmail.com Yes, I've come to the same conclusion, although I have my concerns about having such a dependency between patches. Will look at the suggested patches soon.
> On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > Some more comments... Thanks for reviewing. Since this patch took much longer than I expected, it's useful to have someone to look at it with a "fresh eyes". > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > + _bt_update_skip_scankeys(scan, indexRel); > + > ....... > + /* > + * We haven't found scan key within the current page, so let's scan from > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > + * number > + */ > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > + &buf, BT_READ, scan->xs_snapshot); > > Why do we need to set so->skipScanKey->nextkey = > ScanDirectionIsForward(dir); multiple times? I think it is fine to > just set it once? I believe it was necessary for previous implementations, but in the current version we can avoid this, you're right. > +static inline bool > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > + Buffer buf, ScanDirection dir) > +{ > + OffsetNumber low, high; > + Page page = BufferGetPage(buf); > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > + > + low = P_FIRSTDATAKEY(opaque); > + high = PageGetMaxOffsetNumber(page); > + > + if (unlikely(high < low)) > + return false; > + > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > + _bt_compare(scan->indexRelation, key, page, high) < 1); > +} > > I think the high key condition should be changed to > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > prefix qual is equal to the high key then also there is no point in > searching on the current page so we can directly skip. From nbtree/README and comments to functions like _bt_split I've got an impression that the high key could be equal to the last item on the leaf page, so there is a point in searching. Is that incorrect? > + /* Check if an index skip scan is possible. */ > + can_skip = enable_indexskipscan & index->amcanskip; > + > + /* > + * Skip scan is not supported when there are qual conditions, which are not > + * covered by index. The reason for that is that those conditions are > + * evaluated later, already after skipping was applied. > + * > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > + * index expressions. For that we need to examine index_clauses too. > + */ > + if (root->parse->jointree != NULL) > + { > + ListCell *lc; > + > + foreach(lc, (List *)root->parse->jointree->quals) > + { > + Node *expr, *qual = (Node *) lfirst(lc); > + Var *var; > > I think we can avoid checking for expression if can_skip is already false. Yes, that makes sense. I'll include your suggestions into the next rebased version I'm preparing.
On Sun, May 10, 2020 at 11:17 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > > > Some more comments... > > Thanks for reviewing. Since this patch took much longer than I expected, > it's useful to have someone to look at it with a "fresh eyes". > > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + _bt_update_skip_scankeys(scan, indexRel); > > + > > ....... > > + /* > > + * We haven't found scan key within the current page, so let's scan from > > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > > + * number > > + */ > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > > + &buf, BT_READ, scan->xs_snapshot); > > > > Why do we need to set so->skipScanKey->nextkey = > > ScanDirectionIsForward(dir); multiple times? I think it is fine to > > just set it once? > > I believe it was necessary for previous implementations, but in the > current version we can avoid this, you're right. > > > +static inline bool > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > + Buffer buf, ScanDirection dir) > > +{ > > + OffsetNumber low, high; > > + Page page = BufferGetPage(buf); > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > + > > + low = P_FIRSTDATAKEY(opaque); > > + high = PageGetMaxOffsetNumber(page); > > + > > + if (unlikely(high < low)) > > + return false; > > + > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > +} > > > > I think the high key condition should be changed to > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > prefix qual is equal to the high key then also there is no point in > > searching on the current page so we can directly skip. > > From nbtree/README and comments to functions like _bt_split I've got an > impression that the high key could be equal to the last item on the leaf > page, so there is a point in searching. Is that incorrect? But IIUC, here we want to decide whether we will get the next key in the current page or not? Is my understanding is correct? So if our key (the last tuple key) is equal to the high key means the max key on this page is the same as what we already got in the last tuple so why would we want to go on this page? because this will not give us the new key. So ideally, we should only be looking into this page if our last tuple key is smaller than the high key. Am I missing something? > > > + /* Check if an index skip scan is possible. */ > > + can_skip = enable_indexskipscan & index->amcanskip; > > + > > + /* > > + * Skip scan is not supported when there are qual conditions, which are not > > + * covered by index. The reason for that is that those conditions are > > + * evaluated later, already after skipping was applied. > > + * > > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > > + * index expressions. For that we need to examine index_clauses too. > > + */ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > > > I think we can avoid checking for expression if can_skip is already false. > > Yes, that makes sense. I'll include your suggestions into the next > rebased version I'm preparing. Ok. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > +static inline bool > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > + Buffer buf, ScanDirection dir) > > > +{ > > > + OffsetNumber low, high; > > > + Page page = BufferGetPage(buf); > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > + > > > + low = P_FIRSTDATAKEY(opaque); > > > + high = PageGetMaxOffsetNumber(page); > > > + > > > + if (unlikely(high < low)) > > > + return false; > > > + > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > +} > > > > > > I think the high key condition should be changed to > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > prefix qual is equal to the high key then also there is no point in > > > searching on the current page so we can directly skip. > > > > From nbtree/README and comments to functions like _bt_split I've got an > > impression that the high key could be equal to the last item on the leaf > > page, so there is a point in searching. Is that incorrect? > > But IIUC, here we want to decide whether we will get the next key in > the current page or not? In general this function does what it says, it checks wether or not the provided scankey could be found within the page. All the logic about finding a proper next key to fetch is implemented on the call site, and within this function we want to test whatever was passed in. Does it answer the question?
On Mon, May 11, 2020 at 4:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > > > +static inline bool > > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > > + Buffer buf, ScanDirection dir) > > > > +{ > > > > + OffsetNumber low, high; > > > > + Page page = BufferGetPage(buf); > > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > > + > > > > + low = P_FIRSTDATAKEY(opaque); > > > > + high = PageGetMaxOffsetNumber(page); > > > > + > > > > + if (unlikely(high < low)) > > > > + return false; > > > > + > > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > > +} > > > > > > > > I think the high key condition should be changed to > > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > > prefix qual is equal to the high key then also there is no point in > > > > searching on the current page so we can directly skip. > > > > > > From nbtree/README and comments to functions like _bt_split I've got an > > > impression that the high key could be equal to the last item on the leaf > > > page, so there is a point in searching. Is that incorrect? > > > > But IIUC, here we want to decide whether we will get the next key in > > the current page or not? > > In general this function does what it says, it checks wether or not the > provided scankey could be found within the page. All the logic about > finding a proper next key to fetch is implemented on the call site, and > within this function we want to test whatever was passed in. Does it > answer the question? Ok, I agree that the function is doing what it is expected to do. But, then I have a problem with this call site. + /* Check if the next unique key can be found within the current page. + * Since we do not lock the current page between jumps, it's possible + * that it was splitted since the last time we saw it. This is fine in + * case of scanning forward, since page split to the right and we are + * still on the left most page. In case of scanning backwards it's + * possible to loose some pages and we need to remember the previous + * page, and then follow the right link from the current page until we + * find the original one. + * + * Since the whole idea of checking the current page is to protect + * ourselves and make more performant statistic mismatch case when + * there are too many distinct values for jumping, it's not clear if + * the complexity of this solution in case of backward scan is + * justified, so for now just avoid it. + */ + if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir)) + { + LockBuffer(so->currPos.buf, BT_READ); + + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) + { Here we expect whether the "next" unique key can be found on this page or not, but we are using the function which suggested whether the "current" key can be found on this page or not. I think in boundary cases where the high key is equal to the current key, this function will return true (which is expected from this function), and based on that we will simply scan the current page and IMHO that cost could be avoided no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 20, 2020 at 5:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > You can add another assertion that calls a new utility function in > bufmgr.c. That can use the same logic as this existing assertion in > FlushOneBuffer(): > > Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); > > We haven't needed assertions like this so far because it's usually it > is clear whether or not a buffer lock is held (plus the bufmgr.c > assertions help on their own). Just in case anybody missed it, I am working on a patch that makes nbtree use Valgrind instrumentation to detect page accessed without a buffer content lock held: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com There is also one component that detects when any buffer is accessed without a buffer pin. -- Peter Geoghegan
> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote: > > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) > + { > > Here we expect whether the "next" unique key can be found on this page > or not, but we are using the function which suggested whether the > "current" key can be found on this page or not. I think in boundary > cases where the high key is equal to the current key, this function > will return true (which is expected from this function), and based on > that we will simply scan the current page and IMHO that cost could be > avoided no? Yes, looks like you're right, there is indeed an unecessary extra scan happening. To avoid that we can see the key->nextkey and adjust higher boundary correspondingly. Will also add this into the next rebased patch, thanks!
> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote:
>
> + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> + {
>
> Here we expect whether the "next" unique key can be found on this page
> or not, but we are using the function which suggested whether the
> "current" key can be found on this page or not. I think in boundary
> cases where the high key is equal to the current key, this function
> will return true (which is expected from this function), and based on
> that we will simply scan the current page and IMHO that cost could be
> avoided no?
Yes, looks like you're right, there is indeed an unecessary extra scan
happening. To avoid that we can see the key->nextkey and adjust higher
boundary correspondingly. Will also add this into the next rebased
patch, thanks!
> On Mon, Apr 06, 2020 at 06:31:08PM +0000, Floris Van Nee wrote:
>
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I
> > would appreciate if in the future you can make it more visible that changes you
> > suggest contain some fixes. E.g. it wasn't clear for me from your previous email
> > that that's the case, and it doesn't make sense to pull into different direction
> > when we're trying to achieve the same goal :)
>
> I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned it here of course. It's just that because my patch handles filter conditions in general, it works for this case too.
Oh, then fortunately I've got a wrong impression, sorry and thanks for
clarification :)
> > > > In the patch I posted a week ago these cases are all handled
> > > > correctly, as it introduces this extra logic in the Executor.
> > >
> > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> >
> > I'll definitely take a look at suggested changes in filtering part.
>
> It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pull the information about skipping one level up, out of the node, into the generic IndexNext code.
I was actually thinking more about just preventing skip scan in this
situation, which is if I'm not mistaken could be solved by inspecting
qual conditions to figure out if they're covered in the index -
something like in attachments (this implementation is actually too
restrictive in this sense and will not allow e.g. expressions, that's
why I haven't bumped patch set version for it - soon I'll post an
extended version).
Other than that to summarize current open points for future readers
(this thread somehow became quite big):
* Making UniqueKeys usage more generic to allow using skip scan for more
use cases (hopefully it was covered by the v33, but I still need a
confirmation from David, like blinking twice or something).
* Suspicious performance difference between different type of workload,
mentioned by Tomas (unfortunately I had no chance yet to investigate).
* Thinking about supporting conditions, that are not covered by the index,
to make skipping more flexible (one of the potential next steps in the
future, as suggested by Floris).
> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote: > > > Other than that to summarize current open points for future readers > > (this thread somehow became quite big): > > > > * Making UniqueKeys usage more generic to allow using skip scan for more > > use cases (hopefully it was covered by the v33, but I still need a > > confirmation from David, like blinking twice or something). > > > > * Suspicious performance difference between different type of workload, > > mentioned by Tomas (unfortunately I had no chance yet to investigate). > > > > * Thinking about supporting conditions, that are not covered by the index, > > to make skipping more flexible (one of the potential next steps in the > > future, as suggested by Floris). > > > > Looks this is the latest patch, which commit it is based on? Thanks I have a rebased version, if you're about it. Didn't posted it yet mostly since I'm in the middle of adapting it to the UniqueKeys from other thread. Would it be ok for you to wait a bit until I'll post finished version?
> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
>
> > Other than that to summarize current open points for future readers
> > (this thread somehow became quite big):
> >
> > * Making UniqueKeys usage more generic to allow using skip scan for more
> > use cases (hopefully it was covered by the v33, but I still need a
> > confirmation from David, like blinking twice or something).
> >
> > * Suspicious performance difference between different type of workload,
> > mentioned by Tomas (unfortunately I had no chance yet to investigate).
> >
> > * Thinking about supporting conditions, that are not covered by the index,
> > to make skipping more flexible (one of the potential next steps in the
> > future, as suggested by Floris).
> >
>
> Looks this is the latest patch, which commit it is based on? Thanks
I have a rebased version, if you're about it. Didn't posted it yet
mostly since I'm in the middle of adapting it to the UniqueKeys from
other thread. Would it be ok for you to wait a bit until I'll post
finished version?