Thread: Index Skip Scan

Re: Index Skip Scan

From

Alexander Korotkov

Date:

18 June 2018, 20:31:48

Hi!

On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> I would like to start a discussion on Index Skip Scan referred to as
> Loose Index Scan in the wiki [1].

Great, I glad to see you working in this!

> However, as Robert Haas noted in the thread there are issues with the
> patch as is, especially in relationship to the amcanbackward functionality.
>
> A couple of questions to begin with.
>
> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
> a new node (T_IndexSkipScan) be created ? If latter, then there likely
> will be functionality that needs to be refactored into shared code
> between the nodes.

Is skip scan only possible for index-only scan?  I guess, that no.  We
could also make plain index scan to behave like a skip scan.  And it
should be useful for accelerating DISTINCT ON clause.  Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan.  So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller.  But I don't insist on that, and I would like to hear
other opinions too.

> Which is the best way to deal with the amcanbackward functionality ? Do
> people see another alternative to Robert's idea of adding a flag to the
> scan.

Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient.  So, it seems to not worth
implementing, at least in the initial version.  And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together?  What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.

> I wasn't planning on making this a patch submission for the July
> CommitFest due to the reasons mentioned above, but can do so if people
> thinks it is best. The patch is based on master/4c8156.

Please, register it on commitfest.  If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Index Skip Scan

From

Andrew Dunstan

Date:

18 June 2018, 23:20:07


On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
> Hi all,
>
> I would like to start a discussion on Index Skip Scan referred to as 
> Loose Index Scan in the wiki [1].


awesome


>
>
>
> I wasn't planning on making this a patch submission for the July 
> CommitFest due to the reasons mentioned above, but can do so if people 
> thinks it is best. T


New large features are not appropriate for the July CF. September should 
be your goal.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Alexander Korotkov

Date:

19 June 2018, 00:06:59

On Mon, Jun 18, 2018 at 11:20 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
> > I wasn't planning on making this a patch submission for the July
> > CommitFest due to the reasons mentioned above, but can do so if people
> > thinks it is best. T
>
> New large features are not appropriate for the July CF. September should
> be your goal.

Assuming this, should we have possibility to register patch to
September CF from now?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

19 June 2018, 13:01:24

> On 18 June 2018 at 19:31, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>>
>> A couple of questions to begin with.
>>
>> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
>> a new node (T_IndexSkipScan) be created ? If latter, then there likely
>> will be functionality that needs to be refactored into shared code
>> between the nodes.
>
> Is skip scan only possible for index-only scan?  I guess, that no.  We
> could also make plain index scan to behave like a skip scan.  And it
> should be useful for accelerating DISTINCT ON clause.  Thus, we might
> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
> IndexOnlySkipScan.  So, I don't think I like index scan nodes to
> multiply this way, and it would be probably better to keep number
> nodes smaller.  But I don't insist on that, and I would like to hear
> other opinions too.

In one of patches I'm working on I had similar situation, when I wanted to
split one node into two similar nodes (before I just extended it) and logically
it made perfect sense. But it turned out to be quite useless and the advantage
I've got wasn't worth it - and just to mention, those nodes had more differences
than in this patch. So I agree that probably it would be better to keep using
IndexOnlyScan.

> On 19 June 2018 at 03:40, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote:
>> Assuming this, should we have possibility to register patch to
>> September CF from now?
>
> There cannot be two commit fests marked as open at the same time as
> Magnus mentions here:
> https://www.postgresql.org/message-id/CABUevEx1k+axZcV2t3wEYf5uLg72YbKSch_hUhFnZq+-KSoJsA@mail.gmail.com
>
> In this case, could you wait that the next CF is marked as in progress and
> that the one of September is opened?

Yep, since the next CF will start shortly that's the easiest thing to do.

Re: Index Skip Scan

From

Jesper Pedersen

Date:

19 June 2018, 20:00:36

Hi Alexander,

On 06/18/2018 01:31 PM, Alexander Korotkov wrote:
> <jesper.pedersen@redhat.com> wrote:
>> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
>> a new node (T_IndexSkipScan) be created ? If latter, then there likely
>> will be functionality that needs to be refactored into shared code
>> between the nodes.
> 
> Is skip scan only possible for index-only scan?  I guess, that no.  We
> could also make plain index scan to behave like a skip scan.  And it
> should be useful for accelerating DISTINCT ON clause.  Thus, we might
> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
> IndexOnlySkipScan.  So, I don't think I like index scan nodes to
> multiply this way, and it would be probably better to keep number
> nodes smaller.  But I don't insist on that, and I would like to hear
> other opinions too.
> 

Yes, there are likely other use-cases for Index Skip Scan apart from the 
simplest form. Which sort of suggests that having dedicated nodes would 
be better in the long run.

My goal is to cover the simplest form, which can be handled by extending 
  the T_IndexOnlyScan node, or by having common functions that both use. 
We can always improve the functionality with future patches.

>> Which is the best way to deal with the amcanbackward functionality ? Do
>> people see another alternative to Robert's idea of adding a flag to the
>> scan.
> 
> Supporting amcanbackward seems to be basically possible, but rather
> complicated and not very efficient.  So, it seems to not worth
> implementing, at least in the initial version. > And then the question
> should how index access method report that it supports both skip scan
> and backward scan, but not both together?  What about turning
> amcanbackward into a function which takes (bool skipscan) argument?
> Therefore, this function will return whether backward scan is
> supported depending of whether skip scan is used.
> 

The feedback from Robert Haas seems to suggest that it was a requirement 
for the patch to be considered.

>> I wasn't planning on making this a patch submission for the July
>> CommitFest due to the reasons mentioned above, but can do so if people
>> thinks it is best. The patch is based on master/4c8156.
> 
> Please, register it on commitfest.  If even there wouldn't be enough
> of time for this patch on July commitfest, it's no problem to move it.
> 

Based on the feedback from Andrew and Michael I won't register this 
thread until the September CF.

Thanks for your feedback !

Best regards,
  Jesper

Re: Index Skip Scan

From

Jesper Pedersen

Date:

19 June 2018, 20:06:08

Hi Dmitry,

On 06/19/2018 06:01 AM, Dmitry Dolgov wrote:
>> On 18 June 2018 at 19:31, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>> Is skip scan only possible for index-only scan?  I guess, that no.  We
>> could also make plain index scan to behave like a skip scan.  And it
>> should be useful for accelerating DISTINCT ON clause.  Thus, we might
>> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
>> IndexOnlySkipScan.  So, I don't think I like index scan nodes to
>> multiply this way, and it would be probably better to keep number
>> nodes smaller.  But I don't insist on that, and I would like to hear
>> other opinions too.
> 
> In one of patches I'm working on I had similar situation, when I wanted to
> split one node into two similar nodes (before I just extended it) and logically
> it made perfect sense. But it turned out to be quite useless and the advantage
> I've got wasn't worth it - and just to mention, those nodes had more differences
> than in this patch. So I agree that probably it would be better to keep using
> IndexOnlyScan.
> 

I looked at this today, and creating a new node (T_IndexOnlySkipScan) 
would make the patch more complex.

The question is if the patch should create such a node such that future 
patches didn't have to deal with refactoring to a new node to cover 
additional functionality.

Thanks for your feedback !

Best regards,
  Jesper

Re: Index Skip Scan

From

Bhushan Uparkar

Date:

16 August 2018, 08:44:58

Hello Jesper,

I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is
equivalentto “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it
possiblethat index skip scan can address it as well?  


postgres=# explain verbose select  b from t1 group by b;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Group  (cost=97331.29..97332.01 rows=3 width=4)
   Output: b
   Group Key: t1.b
   ->  Gather Merge  (cost=97331.29..97331.99 rows=6 width=4)
         Output: b
         Workers Planned: 2
         ->  Sort  (cost=96331.27..96331.27 rows=3 width=4)
               Output: b
               Sort Key: t1.b
               ->  Partial HashAggregate  (cost=96331.21..96331.24 rows=3 width=4)
                     Output: b
                     Group Key: t1.b
                     ->  Parallel Seq Scan on public.t1  (cost=0.00..85914.57 rows=4166657 width=4)
                           Output: a, b
(14 rows)

Time: 1.167 ms

— And here is the original example
postgres=# explain verbose SELECT DISTINCT b FROM t1;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Index Skip Scan using idx_t1_b on public.t1  (cost=0.43..1.30 rows=3 width=4)
   Output: b
(2 rows)

Time: 0.987 ms



> On Jun 18, 2018, at 10:31 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> Hi!
>
> On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> I would like to start a discussion on Index Skip Scan referred to as
>> Loose Index Scan in the wiki [1].
>
> Great, I glad to see you working in this!
>
>> However, as Robert Haas noted in the thread there are issues with the
>> patch as is, especially in relationship to the amcanbackward functionality.
>>
>> A couple of questions to begin with.
>>
>> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
>> a new node (T_IndexSkipScan) be created ? If latter, then there likely
>> will be functionality that needs to be refactored into shared code
>> between the nodes.
>
> Is skip scan only possible for index-only scan?  I guess, that no.  We
> could also make plain index scan to behave like a skip scan.  And it
> should be useful for accelerating DISTINCT ON clause.  Thus, we might
> have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
> IndexOnlySkipScan.  So, I don't think I like index scan nodes to
> multiply this way, and it would be probably better to keep number
> nodes smaller.  But I don't insist on that, and I would like to hear
> other opinions too.
>
>> Which is the best way to deal with the amcanbackward functionality ? Do
>> people see another alternative to Robert's idea of adding a flag to the
>> scan.
>
> Supporting amcanbackward seems to be basically possible, but rather
> complicated and not very efficient.  So, it seems to not worth
> implementing, at least in the initial version.  And then the question
> should how index access method report that it supports both skip scan
> and backward scan, but not both together?  What about turning
> amcanbackward into a function which takes (bool skipscan) argument?
> Therefore, this function will return whether backward scan is
> supported depending of whether skip scan is used.
>
>> I wasn't planning on making this a patch submission for the July
>> CommitFest due to the reasons mentioned above, but can do so if people
>> thinks it is best. The patch is based on master/4c8156.
>
> Please, register it on commitfest.  If even there wouldn't be enough
> of time for this patch on July commitfest, it's no problem to move it.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>

Re: Index Skip Scan

From

Thomas Munro

Date:

16 August 2018, 09:22:25

On Thu, Aug 16, 2018 at 5:44 PM, Bhushan Uparkar
<bhushan.uparkar@gmail.com> wrote:
> I was reviewing index-skip patch example and have a comment on it. Example query “select distinct b from t1” is
equivalentto “select b from t1 group by b”. When I tried the 2nd form of query it came up with different plan, is it
possiblethat index skip scan can address it as well?

Yeah, there are a few tricks you can do with "index skip scans"
(Oracle name, or as IBM calls them, "index jump scans"... I was
slightly tempted to suggest we call ours "index hop scans"...). For
example:

* groups and certain aggregates (MIN() and MAX() of suffix index
columns within each group)
* index scans where the scan key doesn't include the leading columns
(but you expect there to be sufficiently few values)
* merge joins (possibly the trickiest and maybe out of range)

You're right that a very simple GROUP BY can be equivalent to a
DISTINCT query, but I'm not sure if it's worth recognising that
directly or trying to implement the more general grouping trick that
can handle MIN/MAX, and whether that should be the same executor
node... The idea of starting with DISTINCT was just that it's
comparatively easy. We should certainly try to look ahead and bear
those features in mind when figuring out the interfaces though. Would
the indexam skip(scan, direction, prefix_size) operation I proposed be
sufficient? Is there a better way?

I'm glad to see this topic come back!

--
Thomas Munro
http://www.enterprisedb.com

Re: Index Skip Scan

From

Jesper Pedersen

Date:

16 August 2018, 21:23:10

Hi Bhushan,

On 08/16/2018 01:44 AM, Bhushan Uparkar wrote:
> I was reviewing index-skip patch example and have a comment on it.

Thanks for your interest, and feedback on this patch !

> Example query “select distinct b from t1” is equivalent to “select b from t1 group by b”. When I tried the 2nd form >
ofquery it came up with different plan, is it possible that index skip scan can address it as well?

> 

Like Thomas commented down-thread my goal is to keep this contribution 
as simple as possible in order to get to something that can be 
committed. Improvements can follow in future CommitFests, which may end 
up in the same release.

However, as stated in my original mail my goal is the simplest form of 
Index Skip Scan (or whatever we call it). I welcome any help with the patch.

Best regards,
  Jesper

Re: Index Skip Scan

From

Jesper Pedersen

Date:

16 August 2018, 21:28:45

Hi Thomas,

On 08/16/2018 02:22 AM, Thomas Munro wrote:
> The idea of starting with DISTINCT was just that it's
> comparatively easy.  We should certainly try to look ahead and bear
> those features in mind when figuring out the interfaces though.  Would
> the indexam skip(scan, direction, prefix_size) operation I proposed be
> sufficient?  Is there a better way?
> 

Yeah, I'm hoping that a Committer can provide some feedback on the 
direction that this patch needs to take.

One thing to consider is the pluggable storage patch, which is a lot 
more important than this patch. I don't want this patch to get in the 
way of that work, so it may have to wait a bit in order to see any new 
potential requirements.

> I'm glad to see this topic come back!
> 

You did the work, and yes hopefully we can get closer to this subject in 
12 :)

Best regards,
  Jesper

Re: Index Skip Scan

From

Stephen Frost

Date:

16 August 2018, 21:36:02

Greetings,

* Jesper Pedersen (jesper.pedersen@redhat.com) wrote:
> On 08/16/2018 02:22 AM, Thomas Munro wrote:
> >The idea of starting with DISTINCT was just that it's
> >comparatively easy.  We should certainly try to look ahead and bear
> >those features in mind when figuring out the interfaces though.  Would
> >the indexam skip(scan, direction, prefix_size) operation I proposed be
> >sufficient?  Is there a better way?
>
> Yeah, I'm hoping that a Committer can provide some feedback on the direction
> that this patch needs to take.

Thomas is one these days. :)

At least on first glance, that indexam seems to make sense to me, but
I've not spent a lot of time thinking about it.  Might be interesting to
ask Peter G about it though.

> One thing to consider is the pluggable storage patch, which is a lot more
> important than this patch. I don't want this patch to get in the way of that
> work, so it may have to wait a bit in order to see any new potential
> requirements.

Not sure where this came from, but I don't think it's particularly good
to be suggesting that one feature is more important than another or that
we need to have one wait for another as this seems to imply.  I'd
certainly really like to see PG finally have skipping scans, for one
thing, and it seems like with some effort that might be able to happen
for v12.  I'm not convinced that we're going to get pluggable storage to
happen in v12 and I don't really agree with recommending that people
hold off on making improvements to things because it's coming.

Thanks!

Stephen

Attachment

signature.asc

Re: Index Skip Scan

From

Jesper Pedersen

Date:

16 August 2018, 21:50:34

Hi Stephen,

On 08/16/2018 02:36 PM, Stephen Frost wrote:
>> Yeah, I'm hoping that a Committer can provide some feedback on the direction
>> that this patch needs to take.
> 
> Thomas is one these days. :)
>

I know :)  However, there are some open questions from Thomas' original 
submission that still needs to be ironed out.

> At least on first glance, that indexam seems to make sense to me, but
> I've not spent a lot of time thinking about it.  Might be interesting to
> ask Peter G about it though.
> 

Yes, or Anastasia who also have done a lot of work on nbtree/.

>> One thing to consider is the pluggable storage patch, which is a lot more
>> important than this patch. I don't want this patch to get in the way of that
>> work, so it may have to wait a bit in order to see any new potential
>> requirements.
> 
> Not sure where this came from, but I don't think it's particularly good
> to be suggesting that one feature is more important than another or that
> we need to have one wait for another as this seems to imply.  I'd
> certainly really like to see PG finally have skipping scans, for one
> thing, and it seems like with some effort that might be able to happen
> for v12.  I'm not convinced that we're going to get pluggable storage to
> happen in v12 and I don't really agree with recommending that people
> hold off on making improvements to things because it's coming.
> 

My point was that I know this patch needs work, so any feedback that get 
it closer to a solution will help.

Pluggable storage may / may not add new requirements, but it is up to 
the people working on that, some of which are Committers, to take time 
"off" to provide feedback for this patch in order steer me in right 
direction.

Work can happen in parallel, and I'm definitely not recommending that 
people hold off on any patches that they want to provide feedback for, 
or submit for a CommitFest.

Best regards,
  Jesper

Re: Index Skip Scan

From

Stephen Frost

Date:

16 August 2018, 22:05:30

Greetings,

* Jesper Pedersen (jesper.pedersen@redhat.com) wrote:
> On 08/16/2018 02:36 PM, Stephen Frost wrote:
> >Not sure where this came from, but I don't think it's particularly good
> >to be suggesting that one feature is more important than another or that
> >we need to have one wait for another as this seems to imply.  I'd
> >certainly really like to see PG finally have skipping scans, for one
> >thing, and it seems like with some effort that might be able to happen
> >for v12.  I'm not convinced that we're going to get pluggable storage to
> >happen in v12 and I don't really agree with recommending that people
> >hold off on making improvements to things because it's coming.
>
> My point was that I know this patch needs work, so any feedback that get it
> closer to a solution will help.
>
> Pluggable storage may / may not add new requirements, but it is up to the
> people working on that, some of which are Committers, to take time "off" to
> provide feedback for this patch in order steer me in right direction.

I don't think it's really necessary for this work to be suffering under
some concern that pluggable storage will make it have to change.  Sure,
it might, but it also very well might not.  For my 2c, anyway, this
seems likely to get into the tree before pluggable storage does and it's
pretty unlikely to be the only thing that that work will need to be
prepared to address when it happens.

> Work can happen in parallel, and I'm definitely not recommending that people
> hold off on any patches that they want to provide feedback for, or submit
> for a CommitFest.

Yes, work can happen in parallel, and I don't really think there needs
to be a concern about some other patch set when it comes to getting this
patch committed.

Thanks!

Stephen

Attachment

signature.asc

Re: Index Skip Scan

From

Peter Geoghegan

Date:

16 August 2018, 22:48:34

On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Yeah, there are a few tricks you can do with "index skip scans"
> (Oracle name, or as IBM calls them, "index jump scans"... I was
> slightly tempted to suggest we call ours "index hop scans"...).

Hopscotch scans?

> * groups and certain aggregates (MIN() and MAX() of suffix index
> columns within each group)
> * index scans where the scan key doesn't include the leading columns
> (but you expect there to be sufficiently few values)
> * merge joins (possibly the trickiest and maybe out of range)

FWIW, I suspect that we're going to have the biggest problems in the
optimizer. It's not as if ndistinct is in any way reliable. That may
matter more on average than it has with other path types.

-- 
Peter Geoghegan

Re: Index Skip Scan

From

Andres Freund

Date:

17 August 2018, 00:44:19

On August 16, 2018 8:28:45 PM GMT+02:00, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>One thing to consider is the pluggable storage patch, which is a lot
>more important than this patch. I don't want this patch to get in the
>way of that work, so it may have to wait a bit in order to see any new
>potential requirements.

I don't think this would be a meaningful, relative to the size of the patch sets, amount of conflict between the two. So I don't think we have to consider relative importance (which I don't think is that easy to assess in this case).

Fwiw, I've a significantly further revised version of the tableam patch that I plan to send in a few days. Ported the current zheap patch as a separate AM which helped weed out a lot of issues.

Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Index Skip Scan

From

Thomas Munro

Date:

17 August 2018, 02:10:56

On Fri, Aug 17, 2018 at 7:48 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> * groups and certain aggregates (MIN() and MAX() of suffix index
>> columns within each group)
>> * index scans where the scan key doesn't include the leading columns
>> (but you expect there to be sufficiently few values)
>> * merge joins (possibly the trickiest and maybe out of range)
>
> FWIW, I suspect that we're going to have the biggest problems in the
> optimizer. It's not as if ndistinct is in any way reliable. That may
> matter more on average than it has with other path types.

Can you give an example of problematic ndistinct underestimation?

I suppose you might be able to defend against that in the executor: if
you find that you've done an unexpectedly high number of skips, you
could fall back to regular next-tuple mode.  Unfortunately that's
require the parent plan node to tolerate non-unique results.

I noticed that the current patch doesn't care about restrictions on
the range (SELECT DISTINCT a FROM t WHERE a BETWEEN 500 and 600), but
that causes it to overestimate the number of btree searches, which is
a less serious problem (it might not chose a skip scan when it would
have been better).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Index Skip Scan

From

Jesper Pedersen

Date:

17 August 2018, 20:15:46

Hi Peter,

On 08/16/2018 03:48 PM, Peter Geoghegan wrote:
> On Wed, Aug 15, 2018 at 11:22 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> * groups and certain aggregates (MIN() and MAX() of suffix index
>> columns within each group)
>> * index scans where the scan key doesn't include the leading columns
>> (but you expect there to be sufficiently few values)
>> * merge joins (possibly the trickiest and maybe out of range)
> 
> FWIW, I suspect that we're going to have the biggest problems in the
> optimizer. It's not as if ndistinct is in any way reliable. That may
> matter more on average than it has with other path types.
> 

Thanks for sharing this; it is very useful to know.

Best regards,
  Jesper

Re: Index Skip Scan

From

Peter Geoghegan

Date:

17 August 2018, 20:52:05

On Thu, Aug 16, 2018 at 4:10 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Can you give an example of problematic ndistinct underestimation?

Yes. See https://postgr.es/m/CAKuK5J12QokFh88tQz-oJMSiBg2QyjM7K7HLnbYi3Ze+Y5BtWQ@mail.gmail.com,
for example. That's a complaint about an underestimation specifically.

This seems to come up about once every 3 years, at least from my
perspective. I'm always surprised that ndistinct doesn't get
implicated in bad query plans more frequently.

> I suppose you might be able to defend against that in the executor: if
> you find that you've done an unexpectedly high number of skips, you
> could fall back to regular next-tuple mode.  Unfortunately that's
> require the parent plan node to tolerate non-unique results.

I like the idea of dynamic fallback in certain situations, but the
details always seem complicated.

-- 
Peter Geoghegan

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

11 September 2018, 00:47:06

> On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> I took Thomas Munro's previous patch [2] on the subject, added a GUC, a
> test case, documentation hooks, minor code cleanups, and made the patch
> pass an --enable-cassert make check-world run. So, the overall design is
> the same.

I've looked through the patch more closely, and have a few questions:

* Is there any reason why only copy function for the IndexOnlyScan node
  includes an implementation for distinctPrefix? Without read/out functionality
  skip doesn't work for parallel scans, so it becomes like that:

=# SET force_parallel_mode TO ON;
=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
                                          QUERY PLAN

-------------------------------------------------------------------------------
 Gather  (cost=1000.43..1001.60 rows=3 width=4)
(actual time=11.054..17672.010 rows=10000000 loops=1)
   Output: b
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   Buffers: shared hit=91035 read=167369
   ->  Index Skip Scan using idx_t1_b on public.t1
   (cost=0.43..1.30 rows=3 width=4)
   (actual time=1.350..16065.165 rows=10000000 loops=1)
         Output: b
         Heap Fetches: 10000000
         Buffers: shared hit=91035 read=167369
         Worker 0: actual time=1.350..16065.165 rows=10000000 loops=1
           Buffers: shared hit=91035 read=167369
 Planning Time: 0.394 ms
 Execution Time: 6037.800 ms

  and with this functionality it gets better:

=# SET force_parallel_mode TO ON;
=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
                                    QUERY PLAN

-------------------------------------------------------------------------------
 Gather  (cost=1000.43..1001.60 rows=3 width=4)
(actual time=3.564..4.427 rows=3 loops=1)
   Output: b
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   Buffers: shared hit=4 read=10
   ->  Index Skip Scan using idx_t1_b on public.t1
   (cost=0.43..1.30 rows=3 width=4)
   (actual time=0.065..0.133 rows=3 loops=1)
         Output: b
         Heap Fetches: 3
         Buffers: shared hit=4 read=10
         Worker 0: actual time=0.065..0.133 rows=3 loops=1
           Buffers: shared hit=4 read=10
 Planning Time: 1.724 ms
 Execution Time: 4.522 ms

* What is the purpose of HeapFetches? I don't see any usage of this variable
  except assigning 0 to it once.

Re: Index Skip Scan

From

Jesper Pedersen

Date:

11 September 2018, 16:21:57

Hi Dmitry,

On 9/10/18 5:47 PM, Dmitry Dolgov wrote:
>> On Mon, 18 Jun 2018 at 17:26, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> I've looked through the patch more closely, and have a few questions:
>

Thanks for your review !

> * Is there any reason why only copy function for the IndexOnlyScan node
>    includes an implementation for distinctPrefix?

Oversight -- thanks for catching that.

> * What is the purpose of HeapFetches? I don't see any usage of this variable
>    except assigning 0 to it once.
> 

That can be removed.

New version WIP v2 attached.

Best regards,
  Jesper

Attachment

wip_indexskipscan_v2.patch

Re: Index Skip Scan

From

Alexander Kuzmenkov

Date:

13 September 2018, 16:01:13

Hi Jesper,

While testing this patch I noticed that current implementation doesn't perform well when we have lots of small groups of equal values. Here is the execution time of index skip scan vs unique over index scan, in ms, depending on the size of group. The benchmark script is attached.

group size skip unique1 2,293.85 132.555 464.40 106.5910 239.61 102.0250 56.59 98.74100 32.56 103.04500 6.08 97.09
So, the current implementation can lead to performance regression, and the choice of the plan depends on the notoriously unreliable ndistinct statistics. The regression is probably because skip scan always does _bt_search to find the next unique tuple. I think we can improve this, and the skip scan can be strictly faster than index scan regardless of the data. As a first approximation, imagine that we somehow skipped equal tuples inside _bt_next instead of sending them to the parent Unique node. This would already be marginally faster than Unique + Index scan. A more practical implementation would be to remember our position in tree (that is, BTStack returned by _bt_search) and use it to skip pages in bulk. This looks straightforward to implement for a tree that does not change, but I'm not sure how to make it work with concurrent modifications. Still, this looks a worthwhile direction to me, because if we have a strictly faster skip scan, we can just use it always and not worry about our unreliable statistics. What do you think?

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

test-skip.sql

Re: Index Skip Scan

From

Jesper Pedersen

Date:

13 September 2018, 18:39:46

Hi Alexander.

On 9/13/18 9:01 AM, Alexander Kuzmenkov wrote:
> While testing this patch

Thanks for the review !

> I noticed that current implementation doesn't 
> perform well when we have lots of small groups of equal values. Here is 
> the execution time of index skip scan vs unique over index scan, in ms, 
> depending on the size of group. The benchmark script is attached.
> 
> group size    skip        unique
> 1             2,293.85    132.55
> 5             464.40      106.59
> 10            239.61      102.02
> 50            56.59       98.74
> 100           32.56       103.04
> 500           6.08        97.09
> 

Yes, this doesn't look good. Using your test case I'm seeing that unique 
is being chosen when the group size is below 34, and skip above. This is 
with the standard initdb configuration; did you change something else ? 
Or did you force the default plan ?

> So, the current implementation can lead to performance regression, and 
> the choice of the plan depends on the notoriously unreliable ndistinct 
> statistics. 

Yes, Peter mentioned this, which I'm still looking at.

> The regression is probably because skip scan always does 
> _bt_search to find the next unique tuple. 

Very likely.

> I think we can improve this, 
> and the skip scan can be strictly faster than index scan regardless of 
> the data. As a first approximation, imagine that we somehow skipped 
> equal tuples inside _bt_next instead of sending them to the parent 
> Unique node. This would already be marginally faster than Unique + Index 
> scan. A more practical implementation would be to remember our position 
> in tree (that is, BTStack returned by _bt_search) and use it to skip 
> pages in bulk. This looks straightforward to implement for a tree that 
> does not change, but I'm not sure how to make it work with concurrent 
> modifications. Still, this looks a worthwhile direction to me, because 
> if we have a strictly faster skip scan, we can just use it always and 
> not worry about our unreliable statistics. What do you think?
> 

This is something to look at -- maybe there is a way to use 
btpo_next/btpo_prev instead/too in order to speed things up. Atm we just 
have the scan key in BTScanOpaqueData. I'll take a look after my 
upcoming vacation; feel free to contribute those changes in the meantime 
of course.

Thanks again !

Best regards,
  Jesper

Re: Index Skip Scan

From

Alexander Kuzmenkov

Date:

13 September 2018, 22:36:24

El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>
> Yes, this doesn't look good. Using your test case I'm seeing that 
> unique is being chosen when the group size is below 34, and skip 
> above. This is with the standard initdb configuration; did you change 
> something else ? Or did you force the default plan ?

Sorry I didn't mention this, the first column is indeed forced skip 
scan, just to see how it compares to index scan.

> This is something to look at -- maybe there is a way to use 
> btpo_next/btpo_prev instead/too in order to speed things up. Atm we 
> just have the scan key in BTScanOpaqueData. I'll take a look after my 
> upcoming vacation; feel free to contribute those changes in the 
> meantime of course.
>

I probably won't be able to contribute the changes, but I'll try to 
review them.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

15 September 2018, 22:52:39

> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
>
> El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>
>> I think we can improve this,
>> and the skip scan can be strictly faster than index scan regardless of
>> the data. As a first approximation, imagine that we somehow skipped
>> equal tuples inside _bt_next instead of sending them to the parent
>> Unique node. This would already be marginally faster than Unique + Index
>> scan. A more practical implementation would be to remember our position
>> in tree (that is, BTStack returned by _bt_search) and use it to skip
>> pages in bulk. This looks straightforward to implement for a tree that
>> does not change, but I'm not sure how to make it work with concurrent
>> modifications. Still, this looks a worthwhile direction to me, because
>> if we have a strictly faster skip scan, we can just use it always and
>> not worry about our unreliable statistics. What do you think?
>>
>
> This is something to look at -- maybe there is a way to use
> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
> have the scan key in BTScanOpaqueData. I'll take a look after my
> upcoming vacation; feel free to contribute those changes in the meantime
> of course.

But having this logic inside _bt_next means that it will make a non-skip index
only scan a bit slower, am I right? Probably it would be easier and more
straightforward to go with the idea of dynamic fallback then. The first naive
implementation that I came up with is to wrap an index scan node into a unique,
and remember estimated number of groups into IndexOnlyScanState, so that we can
check if we performed much more skips than expected. With this approach index
skip scan will work a bit slower than in the original patch in case if
ndistinct is correct (because a unique node will recheck rows we returned), and
fallback to unique + index only scan in case if planner has underestimated
ndistinct.

Attachment

index-skip-fallback.patch

Re: Index Skip Scan

From

Jesper Pedersen

Date:

27 September 2018, 16:59:50

Hi Dmitry,

On 9/15/18 3:52 PM, Dmitry Dolgov wrote:
>> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
>> El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>>> I think we can improve this,
>>> and the skip scan can be strictly faster than index scan regardless of
>>> the data. As a first approximation, imagine that we somehow skipped
>>> equal tuples inside _bt_next instead of sending them to the parent
>>> Unique node. This would already be marginally faster than Unique + Index
>>> scan. A more practical implementation would be to remember our position
>>> in tree (that is, BTStack returned by _bt_search) and use it to skip
>>> pages in bulk. This looks straightforward to implement for a tree that
>>> does not change, but I'm not sure how to make it work with concurrent
>>> modifications. Still, this looks a worthwhile direction to me, because
>>> if we have a strictly faster skip scan, we can just use it always and
>>> not worry about our unreliable statistics. What do you think?
>>>
>>
>> This is something to look at -- maybe there is a way to use
>> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
>> have the scan key in BTScanOpaqueData. I'll take a look after my
>> upcoming vacation; feel free to contribute those changes in the meantime
>> of course.
> 
> But having this logic inside _bt_next means that it will make a non-skip index
> only scan a bit slower, am I right?

Correct.

> Probably it would be easier and more
> straightforward to go with the idea of dynamic fallback then. The first naive
> implementation that I came up with is to wrap an index scan node into a unique,
> and remember estimated number of groups into IndexOnlyScanState, so that we can
> check if we performed much more skips than expected. With this approach index
> skip scan will work a bit slower than in the original patch in case if
> ndistinct is correct (because a unique node will recheck rows we returned), and
> fallback to unique + index only scan in case if planner has underestimated
> ndistinct.
> 

I think we need a comment on this in the patch, as 10 * 
node->ioss_PlanRows looks a bit random.

Thanks for your contribution !

Best regards,
  Jesper

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

27 September 2018, 23:28:34

> On Thu, 27 Sep 2018 at 15:59, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> I think we need a comment on this in the patch, as 10 *
> node->ioss_PlanRows looks a bit random.

Yeah, you're right, it's a bit arbitrary number - we just need to make sure
that this estimation is not too small (to avoid false positives), but also not
too big (to not miss a proper point for fallback). I left it uncommented mostly
because I wanted to get some feedback on it first, and probably some
suggestions about how to make this estimation better.

Re: Index Skip Scan

From

Pavel Stehule

Date:

09 October 2018, 13:42:24

Hi

I tested last patch and I have some notes:

1. 

postgres=# explain select distinct a10000 from foo;
+-------------------------------------------------------------------------------------------+
|                                        QUERY PLAN                                         |
+-------------------------------------------------------------------------------------------+
| Unique  (cost=0.43..4367.56 rows=9983 width=4)                                            |
|   ->  Index Skip Scan using foo_a10000_idx on foo  (cost=0.43..4342.60 rows=9983 width=4) |
+-------------------------------------------------------------------------------------------+
(2 rows)

In this case Unique node is useless and can be removed

2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

3. Once time patched postgres crashed, but I am not able to reproduce it.

Looks like very interesting patch, and important for some BI platforms

Re: Index Skip Scan

From

Jesper Pedersen

Date:

09 October 2018, 13:58:09

Hi Pavel,

On 10/9/18 9:42 AM, Pavel Stehule wrote:
> I tested last patch and I have some notes:
> 
> 1.
> 
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> |                                        QUERY PLAN                                         |
> +-------------------------------------------------------------------------------------------+
> | Unique  (cost=0.43..4367.56 rows=9983 width=4)                                            |
> |   ->  Index Skip Scan using foo_a10000_idx on foo  (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
> 
> In this case Unique node is useless and can be removed
> 
> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport
> 
> 3. Once time patched postgres crashed, but I am not able to reproduce it.
>

Please, send that query through if you can replicate it. The patch 
currently passes an assert'ed check-world, so your query clearly 
triggered something that isn't covered yet.

> Looks like very interesting patch, and important for some BI platforms
> 

Thanks for your review !

Best regards,
  Jesper

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

09 October 2018, 13:59:28

> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> |                                        QUERY PLAN                                         |
> +-------------------------------------------------------------------------------------------+
> | Unique  (cost=0.43..4367.56 rows=9983 width=4)                                            |
> |   ->  Index Skip Scan using foo_a10000_idx on foo  (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed

Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).

> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.

> 3. Once time patched postgres crashed, but I am not able to reproduce it.

Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?

Re: Index Skip Scan

From

Pavel Stehule

Date:

09 October 2018, 14:13:03

út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6@gmail.com> napsal:

> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> | QUERY PLAN |
> +-------------------------------------------------------------------------------------------+
> | Unique (cost=0.43..4367.56 rows=9983 width=4) |
> | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed

Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).

I tested index-skip-fallback.patch.

It looks like good idea, but then the node should be named "index scan" and other info can be displayed in detail parts. It can be similar like "sort".

The combination of unique and index skip scan looks strange :)

> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.

Now, the development cycle is starting - maybe it can use same infrastructure like MIN/MAX and this part can be short.

more if you use dynamic index scan

> 3. Once time patched postgres crashed, but I am not able to reproduce it.

Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?

I think it was query like

select count(*) from (select distinct x from tab) s

Re: Index Skip Scan

From

Pavel Stehule

Date:

09 October 2018, 16:12:31

út 9. 10. 2018 v 16:13 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

út 9. 10. 2018 v 15:59 odesílatel Dmitry Dolgov <9erthalion6@gmail.com> napsal:
> On Tue, 9 Oct 2018 at 15:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I tested last patch and I have some notes:
>
> 1.
>
> postgres=# explain select distinct a10000 from foo;
> +-------------------------------------------------------------------------------------------+
> | QUERY PLAN |
> +-------------------------------------------------------------------------------------------+
> | Unique (cost=0.43..4367.56 rows=9983 width=4) |
> | -> Index Skip Scan using foo_a10000_idx on foo (cost=0.43..4342.60 rows=9983 width=4) |
> +-------------------------------------------------------------------------------------------+
> (2 rows)
>
> In this case Unique node is useless and can be removed

Just to clarify which exactly version were you testing? If
index-skip-fallback.patch,
then the Unique node was added there to address the situation when
ndistinct is underestimated, with an idea to fallback to original plan
(and to tolerate that I suggested to use Unique, since we don't know
if fallback will happen or not during the planning).

I tested index-skip-fallback.patch.

It looks like good idea, but then the node should be named "index scan" and other info can be displayed in detail parts. It can be similar like "sort".

The combination of unique and index skip scan looks strange :)

maybe we don't need special index skip scan node - maybe possibility to return unique values from index scan node can be good enough - some like "distinct index scan" - and the implementation there can be dynamic -skip scan, classic index scan,

"index skip scan" is not good name if the implementaion is dynamic.

> 2. Can be nice COUNT(DISTINCT support) similarly like MIN, MAX suppport

Yep, as far as I understand MIN/MAX is going to be the next step after this
patch will be accepted.

ok

Now, the development cycle is starting - maybe it can use same infrastructure like MIN/MAX and this part can be short.

more if you use dynamic index scan

> 3. Once time patched postgres crashed, but I am not able to reproduce it.

Maybe you have at least some ideas what could cause that or what's the possible
way to reproduce that doesn't work anymore?

I think it was query like

select count(*) from (select distinct x from tab) s

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

10 October 2018, 15:34:27

> On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>> It looks like good idea, but then the node should be named "index scan" and
>> other info can be displayed in detail parts. It can be similar like "sort".
>> The combination of unique and index skip scan looks strange :)
>
> maybe we don't need special index skip scan node - maybe possibility to
> return unique values from index scan node can be good enough - some like
> "distinct index scan" - and the implementation there can be dynamic -skip
> scan, classic index scan,
>
> "index skip scan" is not good  name if the implementaion is dynamic.

Yeah, that's a valid point. The good part is that index skip scan is not really
a separate node, but just enhanced index only scan node. So indeed maybe it
would be better to call it Index Only Scan, but show in details that we apply
the skip scan strategy. Any other opinions about this?

>> I think it was query like
>> select count(*) from (select distinct x from tab) s

Thanks, I'll take a look.

Re: Index Skip Scan

From

Robert Haas

Date:

12 October 2018, 17:43:48

On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> > I noticed that current implementation doesn't
> > perform well when we have lots of small groups of equal values. Here is
> > the execution time of index skip scan vs unique over index scan, in ms,
> > depending on the size of group. The benchmark script is attached.
> >
> > group size    skip        unique
> > 1             2,293.85    132.55
> > 5             464.40      106.59
> > 10            239.61      102.02
> > 50            56.59       98.74
> > 100           32.56       103.04
> > 500           6.08        97.09
> >
>
> Yes, this doesn't look good. Using your test case I'm seeing that unique
> is being chosen when the group size is below 34, and skip above.

I'm not sure exactly how the current patch is approaching the problem,
but it seems like it might be reasonable to do something like -- look
for a distinct item within the current page; if not found, then search
from the root of the tree for the next item > the current item.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

16 October 2018, 19:22:18

> On Fri, 12 Oct 2018 at 19:44, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 13, 2018 at 11:40 AM Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
> > > I noticed that current implementation doesn't
> > > perform well when we have lots of small groups of equal values. Here is
> > > the execution time of index skip scan vs unique over index scan, in ms,
> > > depending on the size of group. The benchmark script is attached.
> > >
> > > group size    skip        unique
> > > 1             2,293.85    132.55
> > > 5             464.40      106.59
> > > 10            239.61      102.02
> > > 50            56.59       98.74
> > > 100           32.56       103.04
> > > 500           6.08        97.09
> > >
> >
> > Yes, this doesn't look good. Using your test case I'm seeing that unique
> > is being chosen when the group size is below 34, and skip above.
>
> I'm not sure exactly how the current patch is approaching the problem,
> but it seems like it might be reasonable to do something like -- look
> for a distinct item within the current page; if not found, then search
> from the root of the tree for the next item > the current item.

I'm not sure that I understand it correctly, can you elaborate please? From
what I see it's quite similar to what's already implemented - we look for a
distinct item on the page, and then search the index tree for a next distinct
item.

> On Wed, 10 Oct 2018 at 17:34, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Tue, 9 Oct 2018 at 18:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> >
> >> It looks like good idea, but then the node should be named "index scan" and
> >> other info can be displayed in detail parts. It can be similar like "sort".
> >> The combination of unique and index skip scan looks strange :)
> >
> > maybe we don't need special index skip scan node - maybe possibility to
> > return unique values from index scan node can be good enough - some like
> > "distinct index scan" - and the implementation there can be dynamic -skip
> > scan, classic index scan,
> >
> > "index skip scan" is not good  name if the implementaion is dynamic.
>
> Yeah, that's a valid point. The good part is that index skip scan is not really
> a separate node, but just enhanced index only scan node. So indeed maybe it
> would be better to call it Index Only Scan, but show in details that we apply
> the skip scan strategy. Any other opinions about this?

To make it more clean what I mean, see attached version of the patch.

> >> I think it was query like
> >> select count(*) from (select distinct x from tab) s
>
> Thanks, I'll take a look.

I couldn't reproduce it either yet.

Attachment

index-skip-fallback-v2.patch

Re: Index Skip Scan

From

Sergei Kornilov

Date:

12 November 2018, 12:28:58

Hello

Last published patch index-skip-fallback-v2 applied and builds cleanly.

I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
> create table tablename (i int primary key);
> select distinct i from tablename where i = 1;
Query is obviously strange, but this is bug.

Also i noticed two TODO in documentation.

regards, Sergei

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

12 November 2018, 12:55:28

> On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk@zsrv.org> wrote:
>
> I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
> > create table tablename (i int primary key);
> > select distinct i from tablename where i = 1;
> Query is obviously strange, but this is bug.

Wow, thanks a lot! I can reproduce it too, will fix it.

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

14 November 2018, 16:48:47

> On Mon, 12 Nov 2018 at 13:55, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov <sk@zsrv.org> wrote:
> >
> > I found reproducible crash due assert failure: FailedAssertion("!(numCols > 0)", File: "pathnode.c", Line: 2795)
> > > create table tablename (i int primary key);
> > > select distinct i from tablename where i = 1;
> > Query is obviously strange, but this is bug.
>
> Wow, thanks a lot! I can reproduce it too, will fix it.

Yep, we had to check number of distinct columns too, here is the fixed patch
(with a bit more verbose commit message).

Attachment

0001-Index-skip-scan-with-fallback-v3.patch

Re: Index Skip Scan

From

Alexander Kuzmenkov

Date:

15 November 2018, 11:41:41

On 9/27/18 16:59, Jesper Pedersen wrote:
> Hi Dmitry,
>
> On 9/15/18 3:52 PM, Dmitry Dolgov wrote:
>>> On Thu, 13 Sep 2018 at 21:36, Alexander Kuzmenkov 
>>> <a.kuzmenkov@postgrespro.ru> wrote:
>>> El 13/09/18 a las 18:39, Jesper Pedersen escribió:
>>>> I think we can improve this,
>>>> and the skip scan can be strictly faster than index scan regardless of
>>>> the data.
>>>> <...>
>>>>
>>>
>>> This is something to look at -- maybe there is a way to use
>>> btpo_next/btpo_prev instead/too in order to speed things up. Atm we 
>>> just
>>> have the scan key in BTScanOpaqueData. I'll take a look after my
>>> upcoming vacation; feel free to contribute those changes in the 
>>> meantime
>>> of course.
>>
>> But having this logic inside _bt_next means that it will make a 
>> non-skip index
>> only scan a bit slower, am I right?
>
> Correct.

Well, it depends on how the skip scan is implemented. We don't have to 
make normal scans slower, because skip scan is just a separate thing.

My main point was that current implementation is good as a proof of 
concept, but it is inefficient for some data and needs some unreliable 
planner logic to work around this inefficiency. And now we also have 
execution-time fallback because planning-time fallback isn't good 
enough. This looks overly complicated. Let's try to design an algorithm 
that works well regardless of the particular data and doesn't need these 
heuristics. It should be possible to do so for btree.

Of course, I understand the reluctance to implement an entire new type 
of btree traversal. Anastasia Lubennikova suggested a tweak for the 
current method that may improve the performance for small groups of 
equal values. When we search for the next unique key, first check if it 
is contained on the current btree page using its 'high key'. If it is, 
find it on the current page. If not, search from the root of the tree 
like we do now. This should improve the performance for small equal 
groups, because there are going to be several such groups on the page. 
And this is exactly where we have the regression compared to unique + 
index scan.

By the way, what is the data for which we intend this feature to work? 
Obviously a non-unique btree index, but how wide are the tuples, and how 
big the equal groups? It would be good to have some real-world examples.

-- 
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Index Skip Scan

From

James Coleman

Date:

15 November 2018, 14:28:03

> Is skip scan only possible for index-only scan?

I haven't see much discussion of this question yet. Is there a
particular reason to lock ourselves into thinking about this only in
an index only scan?

>> I think we can improve this,
>> and the skip scan can be strictly faster than index scan regardless of
>> the data. As a first approximation, imagine that we somehow skipped
>> equal tuples inside _bt_next instead of sending them to the parent
>> Unique node. This would already be marginally faster than Unique + Index
>> scan. A more practical implementation would be to remember our position
>> in tree (that is, BTStack returned by _bt_search) and use it to skip
>> pages in bulk. This looks straightforward to implement for a tree that
>> does not change, but I'm not sure how to make it work with concurrent
>> modifications. Still, this looks a worthwhile direction to me, because
>> if we have a strictly faster skip scan, we can just use it always and
>> not worry about our unreliable statistics. What do you think?
>>
>
> This is something to look at -- maybe there is a way to use
> btpo_next/btpo_prev instead/too in order to speed things up. Atm we just
> have the scan key in BTScanOpaqueData. I'll take a look after my
> upcoming vacation; feel free to contribute those changes in the meantime
> of course.

It seems to me also that the logic necessary for this kind of
traversal has other useful applications. For example, it should be
possible to build on that logic to allow and index like t(owner_fk,
created_at) to be used to execute the following query:

select *
from t
where owner_fk in (1,2,3)
order by created_at
limit 25

without needing to fetch all tuples satisfying "owner_fk in (1,2,3)"
and subsequently sorting them.

Re: Index Skip Scan

From

Jesper Pedersen

Date:

16 November 2018, 15:06:09

Hi,

On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
>>> But having this logic inside _bt_next means that it will make a 
>>> non-skip index
>>> only scan a bit slower, am I right?
>>
>> Correct.
> 
> Well, it depends on how the skip scan is implemented. We don't have to 
> make normal scans slower, because skip scan is just a separate thing.
> 
> My main point was that current implementation is good as a proof of 
> concept, but it is inefficient for some data and needs some unreliable 
> planner logic to work around this inefficiency. And now we also have 
> execution-time fallback because planning-time fallback isn't good 
> enough. This looks overly complicated. Let's try to design an algorithm 
> that works well regardless of the particular data and doesn't need these 
> heuristics. It should be possible to do so for btree.
> 
> Of course, I understand the reluctance to implement an entire new type 
> of btree traversal. Anastasia Lubennikova suggested a tweak for the 
> current method that may improve the performance for small groups of 
> equal values. When we search for the next unique key, first check if it 
> is contained on the current btree page using its 'high key'. If it is, 
> find it on the current page. If not, search from the root of the tree 
> like we do now. This should improve the performance for small equal 
> groups, because there are going to be several such groups on the page. 
> And this is exactly where we have the regression compared to unique + 
> index scan.
>

Robert suggested something similar in [1]. I'll try and look at that 
when I'm back from my holiday.

> By the way, what is the data for which we intend this feature to work? 
> Obviously a non-unique btree index, but how wide are the tuples, and how 
> big the equal groups? It would be good to have some real-world examples.
> 

Although my primary use-case is int I agree that we should test the 
different data types, and tuple widths.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com

Best regards,
  Jesper

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

17 November 2018, 23:27:07

> On Fri, 16 Nov 2018 at 16:06, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
> >>> But having this logic inside _bt_next means that it will make a
> >>> non-skip index
> >>> only scan a bit slower, am I right?
> >>
> >> Correct.
> >
> > Well, it depends on how the skip scan is implemented. We don't have to
> > make normal scans slower, because skip scan is just a separate thing.
> >
> > My main point was that current implementation is good as a proof of
> > concept, but it is inefficient for some data and needs some unreliable
> > planner logic to work around this inefficiency. And now we also have
> > execution-time fallback because planning-time fallback isn't good
> > enough. This looks overly complicated. Let's try to design an algorithm
> > that works well regardless of the particular data and doesn't need these
> > heuristics. It should be possible to do so for btree.
> >
> > Of course, I understand the reluctance to implement an entire new type
> > of btree traversal. Anastasia Lubennikova suggested a tweak for the
> > current method that may improve the performance for small groups of
> > equal values. When we search for the next unique key, first check if it
> > is contained on the current btree page using its 'high key'. If it is,
> > find it on the current page. If not, search from the root of the tree
> > like we do now. This should improve the performance for small equal
> > groups, because there are going to be several such groups on the page.
> > And this is exactly where we have the regression compared to unique +
> > index scan.
>
> Robert suggested something similar in [1]. I'll try and look at that
> when I'm back from my holiday.

Yeah, probably you're right. Unfortunately, I've misunderstood the previous
Robert's message in this thread with the similar approach. Jesper, I hope you
don't mind if I'll post an updated patch? _bt_skip is changed there in the
suggested way, so that it checks the current page before searching from the
root of a tree, and I've removed the fallback logic. After some
initial tests I see
that with this version skip scan over a table with 10^7 rows and 10^6
distinct values is slightly slower than a regular scan, but not that much.

> > By the way, what is the data for which we intend this feature to work?
> > Obviously a non-unique btree index, but how wide are the tuples, and how
> > big the equal groups? It would be good to have some real-world examples.
>
> Although my primary use-case is int I agree that we should test the
> different data types, and tuple widths.

My personal motivation here is exactly that we face use-cases for skip scan
from time to time. Usually it's quite few distinct values (up to a dozen or
so, which means that equal groups are quite big), but with the variety of types
and widths.

> On Thu, 15 Nov 2018 at 15:28, James Coleman <jtc331@gmail.com> wrote:
>
> > Is skip scan only possible for index-only scan?
>
> I haven't see much discussion of this question yet. Is there a
> particular reason to lock ourselves into thinking about this only in
> an index only scan?

I guess, the only reason is to limit the scope of the patch.

Attachment

0001-Index-skip-scan-v4.patch

Re: Index Skip Scan

From

Alexander Kuzmenkov

Date:

21 November 2018, 15:38:55

On 11/18/18 02:27, Dmitry Dolgov wrote:
>
> [0001-Index-skip-scan-v4.patch]

I ran a couple of tests on this, please see the cases below. As before, 
I'm setting total_cost = 1 for index skip scan so that it is chosen. 
Case 1 breaks because we determine the high key incorrectly, it is the 
second tuple on page or something like that, not the last tuple. Case 2 
is backwards scan, I don't understand how it is supposed to work. We 
call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it 
just fetches the previous tuple like the regular scan does.

case 1:
# create table t as select generate_series(1, 1000000) a;
# create index ta on t(a);
# explain select count(*) from (select distinct a from t) d;
                                QUERY PLAN
-------------------------------------------------------------------------
  Aggregate  (cost=3.50..3.51 rows=1 width=8)
    ->  Index Only Scan using ta on t  (cost=0.42..1.00 rows=200 width=4)
          Scan mode: Skip scan
(3 rows)

postgres=# select count(*) from (select distinct a from t) d;
  count
--------
  500000 -- should be 1kk
(1 row)


case 2:
# create table t as select generate_series(1, 1000000) / 2 a;
# create index ta on t(a);
# explain select count(*) from (select distinct a from t order by a desc) d;
                                      QUERY PLAN
-------------------------------------------------------------------------------------
  Aggregate  (cost=5980.81..5980.82 rows=1 width=8)
    ->  Index Only Scan Backward using ta on t  (cost=0.42..1.00 
rows=478385 width=4)
          Scan mode: Skip scan
(3 rows)

# select count(*) from (select distinct a from t order by a desc) d;
  count
--------
  502733 -- should be 500k

(1 row)


-- 

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

21 November 2018, 20:56:34

> On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
>
> On 11/18/18 02:27, Dmitry Dolgov wrote:
> >
> > [0001-Index-skip-scan-v4.patch]
>
> I ran a couple of tests on this, please see the cases below. As before,
> I'm setting total_cost = 1 for index skip scan so that it is chosen.
> Case 1 breaks because we determine the high key incorrectly, it is the
> second tuple on page or something like that, not the last tuple.

From what I see it wasn't about the high key, just a regular off by one error.
But anyway, thanks for noticing - for some reason it wasn't always
reproduceable for me, so I missed this issue. Please find fixed patch attached.
Also I think it invalidates my previous performance tests, so I would
appreciate if you can check it out too.

> Case 2
> is backwards scan, I don't understand how it is supposed to work. We
> call _bt_search(nextKey = ScanDirectionIsForward), so it seems that it
> just fetches the previous tuple like the regular scan does.

Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
remember from the previous discussions the entire topic of backward scan is
questionable for this patch, so I'll try to invest some time in it.

Attachment

0001-Index-skip-scan-v5.patch

Re: Index Skip Scan

From

Peter Geoghegan

Date:

04 December 2018, 03:26:31

On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
> remember from the previous discussions the entire topic of backward scan is
> questionable for this patch, so I'll try to invest some time in it.

Another thing that I think is related to skip scans that you should be
aware of is dynamic prefix truncation, which I started a thread on
just now [1]. While I see one big problem with the POC patch I came up
with, I think that that optimization is going to be something that
ends up happening at some point. Repeatedly descending a B-Tree when
the leading column is very low cardinality can be made quite a lot
less expensive by dynamic prefix truncation. Actually, it's almost a
perfect case for it.

I'm not asking anybody to do anything with that information. "Big
picture" thinking seems particularly valuable when working on the
B-Tree code; I don't want anybody to miss a possible future
opportunity.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
-- 
Peter Geoghegan

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

20 December 2018, 13:46:09

> On Wed, Nov 21, 2018 at 9:56 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> wrote:
> >
> > On 11/18/18 02:27, Dmitry Dolgov wrote:
> > >
> > > [0001-Index-skip-scan-v4.patch]
> >
> > I ran a couple of tests on this, please see the cases below. As before,
> > I'm setting total_cost = 1 for index skip scan so that it is chosen.
> > Case 1 breaks because we determine the high key incorrectly, it is the
> > second tuple on page or something like that, not the last tuple.
>
> From what I see it wasn't about the high key, just a regular off by one error.
> But anyway, thanks for noticing - for some reason it wasn't always
> reproduceable for me, so I missed this issue. Please find fixed patch attached.
> Also I think it invalidates my previous performance tests, so I would
> appreciate if you can check it out too.

I've performed some testing, and on my environment with a dataset of 10^7
records:

* everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
  scan.

* above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
  records it was about 20% slower than the regular index scan.

For me these numbers sound good, since even in quite extreme case of
approximately 10 records per group the performance of index skip scan is close
to the same for the regular index only scan.

> On Tue, Dec 4, 2018 at 4:26 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I
> > remember from the previous discussions the entire topic of backward scan is
> > questionable for this patch, so I'll try to invest some time in it.
>
> Another thing that I think is related to skip scans that you should be
> aware of is dynamic prefix truncation, which I started a thread on
> just now [1]. While I see one big problem with the POC patch I came up
> with, I think that that optimization is going to be something that
> ends up happening at some point. Repeatedly descending a B-Tree when
> the leading column is very low cardinality can be made quite a lot
> less expensive by dynamic prefix truncation. Actually, it's almost a
> perfect case for it.

Thanks, sounds cool. I'll try it out as soon as I'll have some spare time.

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

26 January 2019, 17:45:54

> On Thu, Dec 20, 2018 at 2:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> I've performed some testing, and on my environment with a dataset of 10^7
> records:
>
> * everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip
>   scan.
>
> * above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique
>   records it was about 20% slower than the regular index scan.
>
> For me these numbers sound good, since even in quite extreme case of
> approximately 10 records per group the performance of index skip scan is close
> to the same for the regular index only scan.

Rebased version after rd_amroutine was renamed.

Attachment

0001-Index-skip-scan-v6.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

27 January 2019, 17:17:46

> On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Rebased version after rd_amroutine was renamed.

And one more to fix the documentation. Also I've noticed few TODOs in the patch
about the missing docs, and replaced them with a required explanation of the
feature.

Attachment

0001-Index-skip-scan-v7.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

30 January 2019, 17:19:05

> On Sun, Jan 27, 2019 at 6:17 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Sat, Jan 26, 2019 at 6:45 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > Rebased version after rd_amroutine was renamed.
>
> And one more to fix the documentation. Also I've noticed few TODOs in the patch
> about the missing docs, and replaced them with a required explanation of the
> feature.

A bit of adjustment after nodes/relation -> nodes/pathnodes.

Attachment

0001-Index-skip-scan-v8.patch

Re: Index Skip Scan

From

Kyotaro HORIGUCHI

Date:

31 January 2019, 06:31:53

Hello.

At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
> A bit of adjustment after nodes/relation -> nodes/pathnodes.

I had a look on this.

The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)

Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function.  In such perspective it is not
suitable for AM API level feature.

If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).

=====
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.

Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.

Any opinions?

====

Some comments on the patch below.

+     skip scan approach, which is based on the idea of
+     <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
+     Loose index scan</ulink>. Rather than scanning all equal values of a key,
+     as soon as a new value is found, it will search for a larger value on the

I'm not sure it is a good thing to put a pointer to rather
unstable source in the documentation.


This adds a new AM method but it seems avaiable only for ordered
indexes, specifically btree. And it seems that the feature can be
implemented in btgettuple since btskip apparently does the same
thing. (I agree to Robert in the point in [1]).

[1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com


Related to the above, it seems better that the path generation of
skip scan is a part of index scan. Whether skip scan or not is a
matter of index scan itself.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Jesper Pedersen

Date:

01 February 2019, 19:24:38

Hi,

On 1/31/19 1:31 AM, Kyotaro HORIGUCHI wrote:
> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
>> A bit of adjustment after nodes/relation -> nodes/pathnodes.
> 
> I had a look on this.
> 
> The name "index skip scan" is a different feature from the
> feature with the name on other prodcuts, which means "index scan
> with postfix key (of mainly of multi column key) that scans
> ignoring the prefixing part" As Thomas suggested I'd suggest that
> we call it "index hop scan". (I can accept Hopscotch, either:p)
> 
> Also as mentioned upthread by Peter Geoghegan, this could easly
> give worse plan by underestimation. So I also suggest that this
> has dynamic fallback function.  In such perspective it is not
> suitable for AM API level feature.
> 
> If all leaf pages are on the buffer and the average hopping
> distance is less than (expectedly) 4 pages (the average height of
> the tree), the skip scan will lose. If almost all leaf pages are
> staying on disk, we could win only by 2-pages step (skipping over
> one page).
> 
> =====
> As I'm writing the above, I came to think that it's better
> implement this as an pure executor optimization.
> 
> Specifically, let _bt_steppage count the ratio of skipped pages
> so far then if the ratio exceeds some threshold (maybe around
> 3/4) it gets into hopscotching mode, where it uses index scan to
> find the next page (rather than traversing). As mentioned above,
> I think using skip scan to go beyond the next page is a good
> bet. If the success ration of skip scans goes below some
> threshold (I'm not sure for now), we should fall back to
> traversing.
> 
> Any opinions?
> 
> ====
> 
> Some comments on the patch below.
> 
> +     skip scan approach, which is based on the idea of
> +     <ulink url="https://wiki.postgresql.org/wiki/Free_Space_Map_Problems">
> +     Loose index scan</ulink>. Rather than scanning all equal values of a key,
> +     as soon as a new value is found, it will search for a larger value on the
> 
> I'm not sure it is a good thing to put a pointer to rather
> unstable source in the documentation.
> 
> 
> This adds a new AM method but it seems avaiable only for ordered
> indexes, specifically btree. And it seems that the feature can be
> implemented in btgettuple since btskip apparently does the same
> thing. (I agree to Robert in the point in [1]).
> 
> [1] https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com
> 
> 
> Related to the above, it seems better that the path generation of
> skip scan is a part of index scan. Whether skip scan or not is a
> matter of index scan itself.
> 

Thanks for your valuable feedback ! And, calling it "Loose index scan" 
or something else is better.

Dmitry and I will look at this and take it into account for the next 
version.

For now, I have switched the CF entry to WoA.

Thanks again !

Best regards,
  Jesper

Re: Index Skip Scan

From

James Coleman

Date:

01 February 2019, 21:04:58

On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Also as mentioned upthread by Peter Geoghegan, this could easly
> give worse plan by underestimation. So I also suggest that this
> has dynamic fallback function.  In such perspective it is not
> suitable for AM API level feature.
>
> If all leaf pages are on the buffer and the average hopping
> distance is less than (expectedly) 4 pages (the average height of
> the tree), the skip scan will lose. If almost all leaf pages are
> staying on disk, we could win only by 2-pages step (skipping over
> one page).
>
> =====
> As I'm writing the above, I came to think that it's better
> implement this as an pure executor optimization.
>
> Specifically, let _bt_steppage count the ratio of skipped pages
> so far then if the ratio exceeds some threshold (maybe around
> 3/4) it gets into hopscotching mode, where it uses index scan to
> find the next page (rather than traversing). As mentioned above,
> I think using skip scan to go beyond the next page is a good
> bet. If the success ration of skip scans goes below some
> threshold (I'm not sure for now), we should fall back to
> traversing.
>
> Any opinions?

Hi!

I'd like to offer a counterpoint: in cases where this a huge win we
definitely do want this to affect cost estimation, because if it's
purely an executor optimization the index scan path may not be chosen
even when skip scanning would be a dramatic improvement.

I suppose that both requirements could be met by incorporating it into
the existing index scanning code and also modifying to costing to
(only when we have high confidence?) account for the optimization. I'm
not sure if that makes things better than the current state of the
patch or not.

James Coleman

Re: Index Skip Scan

From

Andres Freund

Date:

01 February 2019, 22:05:03

On 2019-02-01 16:04:58 -0500, James Coleman wrote:
> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Also as mentioned upthread by Peter Geoghegan, this could easly
> > give worse plan by underestimation. So I also suggest that this
> > has dynamic fallback function.  In such perspective it is not
> > suitable for AM API level feature.
> >
> > If all leaf pages are on the buffer and the average hopping
> > distance is less than (expectedly) 4 pages (the average height of
> > the tree), the skip scan will lose. If almost all leaf pages are
> > staying on disk, we could win only by 2-pages step (skipping over
> > one page).
> >
> > =====
> > As I'm writing the above, I came to think that it's better
> > implement this as an pure executor optimization.
> >
> > Specifically, let _bt_steppage count the ratio of skipped pages
> > so far then if the ratio exceeds some threshold (maybe around
> > 3/4) it gets into hopscotching mode, where it uses index scan to
> > find the next page (rather than traversing). As mentioned above,
> > I think using skip scan to go beyond the next page is a good
> > bet. If the success ration of skip scans goes below some
> > threshold (I'm not sure for now), we should fall back to
> > traversing.
> >
> > Any opinions?
> 
> I'd like to offer a counterpoint: in cases where this a huge win we
> definitely do want this to affect cost estimation, because if it's
> purely an executor optimization the index scan path may not be chosen
> even when skip scanning would be a dramatic improvement.

+many.

Greetings,

Andres Freund

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

20 February 2019, 16:35:08

> On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> Dmitry and I will look at this and take it into account for the next
> version.

In the meantime, just to not forget, I'm going to post another version with a
fix for cursor fetch backwards, which was crashing before. And talking about
this topic I wanted to ask to clarify a few points, since looks like I'm
missing something:

One of not yet addressed points in this patch is amcanbackward. From the
historical thread, mentioned in the first email:

> On 2016-11-25 at 01:33 Robert Haas <robertmhaas@gmail.com> wrote:
> +    if (ScanDirectionIsForward(dir))
> +    {
> +        so->currPos.moreLeft = false;
> +        so->currPos.moreRight = true;
> +    }
> +    else
> +    {
> +        so->currPos.moreLeft = true;
> +        so->currPos.moreRight = false;
> +    }
>
>
>
> The lack of comments makes it hard for me to understand what the
> motivation for this is, but I bet it's wrong.  Suppose there's a
> cursor involved here and the user tries to back up. Instead of having
> a separate amskip operation, maybe there should be a flag attached to
> a scan indicating whether it should return only distinct results.
> Otherwise, you're allowing for the possibility that the same scan
> might sometimes skip and other times not skip, but then it seems hard
> for the scan to survive cursor operations.  Anyway, why would that be
> useful?

I assume that "sometimes skip and other times not skip" refers to the
situation, when we did fetch forward and jump something over, and then right
away doing fetch backwards, when we don't actually need to skip anything and
can get the result right away, right? If so, I can't find any comments about
why is it should be a problem for cursor operations?

Attachment

v9-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Jeff Janes

Date:

28 February 2019, 21:45:15

On Wed, Feb 20, 2019 at 11:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> Dmitry and I will look at this and take it into account for the next
> version.

In the meantime, just to not forget, I'm going to post another version with a
fix for cursor fetch backwards, which was crashing before.

This version of the patch can return the wrong answer.

create index on pgbench_accounts (bid, aid);

begin; declare c cursor for select distinct on (bid) bid, aid from pgbench_accounts order by bid, aid;

fetch 2 from c;

bid | aid

-----+---------

1 | 1

2 | 100,001

fetch backward 1 from c;

bid | aid

-----+---------

1 | 100,000

Without the patch, instead of getting a wrong answer, I get an error:

ERROR: cursor can only scan forward

HINT: Declare it with SCROLL option to enable backward scan.

If I add "SCROLL", then I do get the right answer with the patch.

Cheers,

Jeff

Re: Index Skip Scan

From

Jeff Janes

Date:

28 February 2019, 22:23:15

On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello.

At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in <CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
> A bit of adjustment after nodes/relation -> nodes/pathnodes.

I had a look on this.

The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)

I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself out to give the same answer, but faster. I think they would also use this skip method to get distinct values if that is what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this:

Set up:

pgbench -i -s50

create index on pgbench_accounts (bid, aid);

alter table pgbench_accounts drop constraint pgbench_accounts_pkey ;

Query:
explain analyze with t as (select distinct bid from pgbench_accounts )

select pgbench_accounts.* from pgbench_accounts join t using (bid) where aid=5;

If we accept this patch, I hope it would be expanded in the future to give similar performance as the above query does even when the query is written in its more natural way of:

explain analyze select * from pgbench_accounts where aid=5;

(which currently takes 200ms, rather than the 0.9ms taken for the one benefiting from skip scan)

I don't think we should give it a different name, just because our initial implementation is incomplete.

Or do you think our implementation of one feature does not really get us closer to implementing the other?

Cheers,

Jeff

Re: Index Skip Scan

From

Thomas Munro

Date:

28 February 2019, 23:03:06

On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
>> > A bit of adjustment after nodes/relation -> nodes/pathnodes.
>>
>> I had a look on this.
>>
>> The name "index skip scan" is a different feature from the
>> feature with the name on other prodcuts, which means "index scan
>> with postfix key (of mainly of multi column key) that scans
>> ignoring the prefixing part" As Thomas suggested I'd suggest that
>> we call it "index hop scan". (I can accept Hopscotch, either:p)
>
>
> I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan,
nota fundamentally different thing.  They don't ignore the prefix part, they use that part in a way to cancel itself
outto give the same answer, but faster.  I think they would also use this skip method to get distinct values if that is
whatis requested.  But they would go beyond that to also use it to do something similar to the plan we get with this: 

Hi Jeff,

"Hop scan" was just a stupid joke that occurred to me when I saw that
DB2 had gone for "jump scan".  I think "skip scan" is a perfectly good
name and it's pretty widely used by now (for example, by our friends
over at SQLite to blow us away at these kinds of queries).

Yes, simple distinct value scans are indeed just the easiest kind of
thing to do with this kind of scan-with-fast-forward.  As discussed
already in this thread and the earlier one there is a whole family of
tricks you can do, and the thing that most people call an "index skip
scan" is indeed the try-every-prefix case where you can scan an index
on (a, b) given a WHERE clause b = x.  Perhaps the simple distinct
scan could appear as "Distinct Index Skip Scan"?  And perhaps the
try-every-prefix-scan could appear as just "Index Skip Scan"?  Whether
these are the same executor node is a good question; at one point I
proposed a separate nest-loop like node for the try-every-prefix-scan,
but Robert shot that down pretty fast.  I now suspect (as he said)
that all of this belongs in the index scan node, as different modes.
The behaviour is overlapping; for "Distinct Index Skip Scan" you skip
to each distinct prefix and emit one tuple, whereas for "Index Skip
Scan" you skip to each distinct prefix and then perform a regular scan
for the prefix + the suffix emitting matches.

> (which currently takes 200ms, rather than the 0.9ms taken for the one benefiting from skip scan)

Nice.

> I don't think we should give it a different name, just because our initial implementation is incomplete.

+1

> Or do you think our implementation of one feature does not really get us closer to implementing the other?

My question when lobbing the earlier sketch patches into the mailing
list a few years back was: is this simple index AM interface and
implementation (once debugged) powerful enough for various kinds of
interesting skip-based plans?  So far I have the impression that it
does indeed work for Distinct Index Skip Scan (demonstrated), Index
Skip Scan (no one has yet tried that), and special cases of extrema
aggregate queries (foo, MIN(bar) can be performed by skip scan of
index on (foo, bar)), but may not work for the semi-related merge join
trickery mentioned in a paper posted some time back (though I don't
recall exactly why).  Another question is whether it should all be
done by the index scan node, and I think the answer is yet.

--
Thomas Munro
https://enterprisedb.com

Re: Index Skip Scan

From

Tomas Vondra

Date:

28 February 2019, 23:10:55

On 3/1/19 12:03 AM, Thomas Munro wrote:
> On Fri, Mar 1, 2019 at 11:23 AM Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcVP18wYiO=aa+fz3GuncuTF52q1sufB7ise37TJPSDK1w@mail.gmail.com>
>>>> A bit of adjustment after nodes/relation -> nodes/pathnodes.
>>>
>>> I had a look on this.
>>>
>>> The name "index skip scan" is a different feature from the
>>> feature with the name on other prodcuts, which means "index scan
>>> with postfix key (of mainly of multi column key) that scans
>>> ignoring the prefixing part" As Thomas suggested I'd suggest that
>>> we call it "index hop scan". (I can accept Hopscotch, either:p)
>>
>>
>> I think that what we have proposed here is just an incomplete implementation of what other products call a skip
scan,not a fundamentally different thing.  They don't ignore the prefix part, they use that part in a way to cancel
itselfout to give the same answer, but faster.  I think they would also use this skip method to get distinct values if
thatis what is requested.  But they would go beyond that to also use it to do something similar to the plan we get with
this:
> 
> Hi Jeff,
> 
> "Hop scan" was just a stupid joke that occurred to me when I saw that
> DB2 had gone for "jump scan".  I think "skip scan" is a perfectly good
> name and it's pretty widely used by now (for example, by our friends
> over at SQLite to blow us away at these kinds of queries).
> 

+1 to "hop scan"


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

05 March 2019, 15:05:42

> On Thu, Feb 28, 2019 at 10:45 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
> This version of the patch can return the wrong answer.

Yes, indeed. In fact it answers my previous question related to the backward
cursor scan, when while going back we didn't skip enough. Within the current
approach it can be fixed by proper skipping for backward scan, something like
in the attached patch.

Although there are still some rough edges, e.g. going forth, back and forth
again leads to a sutiation, when `_bt_first` is not applied anymore and the
first element is wrongly skipped. I'll try to fix it with the next version of
patch.

> If we accept this patch, I hope it would be expanded in the future to give
> similar performance as the above query does even when the query is written in
> its more natural way of:

Yeah, I hope the current approach with a new index am routine can be extended
for that.

> Without the patch, instead of getting a wrong answer, I get an error:

Right, as far as I can see without a skip scan and SCROLL, a unique + index
scan is used, where amcanbackward is false by default. So looks like it's not
really patch related.

Attachment

v10-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

14 March 2019, 13:32:49

> On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Although there are still some rough edges, e.g. going forth, back and forth
> again leads to a sutiation, when `_bt_first` is not applied anymore and the
> first element is wrongly skipped. I'll try to fix it with the next version of
> patch.

It turns out that `_bt_skip` was unnecessary applied every time when scan was
restarted from the beginning. Here is the fixed version of patch.

Attachment

v11-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Kyotaro HORIGUCHI

Date:

15 March 2019, 00:51:57

At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com>
> > On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > Although there are still some rough edges, e.g. going forth, back and forth
> > again leads to a sutiation, when `_bt_first` is not applied anymore and the
> > first element is wrongly skipped. I'll try to fix it with the next version of
> > patch.
> 
> It turns out that `_bt_skip` was unnecessary applied every time when scan was
> restarted from the beginning. Here is the fixed version of patch.

> nbtsearch.c: In function ‘_bt_skip’:
> nbtsearch.c:1292:11: error: ‘struct IndexScanDescData’ has no member named ‘xs_ctup’; did you mean ‘xs_itup’?
>      scan->xs_ctup.t_self = currItem->heapTid;

Unfortunately a recent commit c2fe139c20 hit this.

Date:   Mon Mar 11 12:46:41 2019 -0700
>       Index scans now store the result of a search in
>       IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
>       target is not generally a HeapTuple anymore that seems cleaner.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Kyotaro HORIGUCHI

Date:

15 March 2019, 03:54:52

Hello.

At Thu, 14 Mar 2019 14:32:49 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcUSuFBhGVFZN_AVSxRbt5wr_4_YEYwv8PcQB=m6J6Zpvg@mail.gmail.com>
> > On Tue, Mar 5, 2019 at 4:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > Although there are still some rough edges, e.g. going forth, back and forth
> > again leads to a sutiation, when `_bt_first` is not applied anymore and the
> > first element is wrongly skipped. I'll try to fix it with the next version of
> > patch.
> 
> It turns out that `_bt_skip` was unnecessary applied every time when scan was
> restarted from the beginning. Here is the fixed version of patch.

I have some comments on the latest v11 patch.

L619:
> +    indexstate->ioss_NumDistinctKeys = node->distinctPrefix;

The number of distinct prefix keys has various names in this
patch.  They should be unified as far as possible.

L:728
> +                    root->distinct_pathkeys > 0)

It is not an integer, but a list.

L730:
> +                    Path *subpath = (Path *)
> +                        create_skipscan_unique_path(root,

The name "subpath" here is not a subpath, but it can be removed
by directly calling create_skipscan_unique_path in add_path.

L:758
> +create_skipscan_unique_path(PlannerInfo *root,
> +                            RelOptInfo *rel,
> +                            Path *subpath,
> +                            int numCols,

The "subpath" is not a subpath. How about basepath, or orgpath?
The "numCols" doesn't makes clear sense. unique_prefix_keys?

L764:
> +    IndexPath *pathnode = makeNode(IndexPath);
> +
> +    Assert(IsA(subpath, IndexPath));
> +
> +    /* We don't want to modify subpath, so make a copy. */
> +    memcpy(pathnode, subpath, sizeof(IndexPath));

Why don't you just use copyObject()?

L773:
> +    Assert(numCols > 0);

Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ?

L586:
> +     * Check if we need to skip to the next key prefix, because we've been
> +     * asked to implement DISTINCT.
> +     */
> +    if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
> +    {
> +        if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> +        {
> +            /* Reached end of index. At this point currPos is invalidated,

I thought a while on this bit. It seems that the lower layer must
know whether it has emitted the first tuple. So I think that this
code can be reduced as the follows.

>   if (node->ioss_NumDistinctKeys && 
>       !index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
>      return ExecClearTupler(slot);

Then, the index_skip returns true with doing nothing if the
scandesc is in the initial state. (Of course other index AMs can
do something in the first call.)  ioss_FirstTupleEmitted and the
comment can be removed.

By the way this patch seems to still be forgetting about the
explicit rescan case but just doing this makes such consideration
not required.

L1032:
> + Index Only Scan using tenk1_four on public.tenk1
> +   Output: four
> +   Scan mode: Skip scan

The "Scan mode" has only one value and it is shown only for
"Index Only Scan" case. It seems to me that "Index Skip Scan"
implies Index Only Scan. How about just "Index Skip Scan"?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

16 March 2019, 16:14:20

> On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I have some comments on the latest v11 patch.

Thank you!

> L619:
> > +    indexstate->ioss_NumDistinctKeys = node->distinctPrefix;
>
> The number of distinct prefix keys has various names in this
> patch.  They should be unified as far as possible.

Good point, I've renamed everything to skipPrefixSize, it seems for me that
this name should be self explanatory enough.

> L:728
> > +                    root->distinct_pathkeys > 0)
>
> It is not an integer, but a list.

Thanks for noticing, fixed (via compare with NIL, since we just need to know if
this list is empty or not).

> L730:
> > +                    Path *subpath = (Path *)
> > +                        create_skipscan_unique_path(root,
>
> The name "subpath" here is not a subpath, but it can be removed
> by directly calling create_skipscan_unique_path in add_path.
>
>
> L:758
> > +create_skipscan_unique_path(PlannerInfo *root,
> > +                            RelOptInfo *rel,
> > +                            Path *subpath,
> > +                            int numCols,
>
> The "subpath" is not a subpath. How about basepath, or orgpath?
> The "numCols" doesn't makes clear sense. unique_prefix_keys?

I agree, suggested names sound good.

> L764:
> > +    IndexPath *pathnode = makeNode(IndexPath);
> > +
> > +    Assert(IsA(subpath, IndexPath));
> > +
> > +    /* We don't want to modify subpath, so make a copy. */
> > +    memcpy(pathnode, subpath, sizeof(IndexPath));
>
> Why don't you just use copyObject()?

Maybe I'm missing something, but I don't see that copyObject works with path
nodes, does it? I've tried it with subpath directly and got `unrecognized node
type`.

> L773:
> > +    Assert(numCols > 0);
>
> Maybe Assert(numCols > 0 && numCols <= list_length(path->pathkeys)); ?

Yeah, makes sense.

> L586:
> > +     * Check if we need to skip to the next key prefix, because we've been
> > +     * asked to implement DISTINCT.
> > +     */
> > +    if (node->ioss_NumDistinctKeys > 0 && node->ioss_FirstTupleEmitted)
> > +    {
> > +        if (!index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> > +        {
> > +            /* Reached end of index. At this point currPos is invalidated,
>
> I thought a while on this bit. It seems that the lower layer must
> know whether it has emitted the first tuple. So I think that this
> code can be reduced as the follows.
>
> >   if (node->ioss_NumDistinctKeys &&
> >       !index_skip(scandesc, direction, node->ioss_NumDistinctKeys))
> >      return ExecClearTupler(slot);
>
> Then, the index_skip returns true with doing nothing if the
> scandesc is in the initial state. (Of course other index AMs can
> do something in the first call.)  ioss_FirstTupleEmitted and the
> comment can be removed.

I'm not sure then, how to figure out when scandesc is in the initial state from
the inside index_skip without passing the node as an argument? E.g. in the
case, describe in the commentary, when we do fetch forward/fetch backward/fetch
forward again.

> L1032:
> > + Index Only Scan using tenk1_four on public.tenk1
> > +   Output: four
> > +   Scan mode: Skip scan
>
> The "Scan mode" has only one value and it is shown only for
> "Index Only Scan" case. It seems to me that "Index Skip Scan"
> implies Index Only Scan. How about just "Index Skip Scan"?

Do you mean, show "Index Only Scan", and then "Index Skip Scan" in details,
instead of "Scan mode", right?

Attachment

v12-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

19 March 2019, 13:07:32

> On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I have some comments on the latest v11 patch.
>
> Thank you!

In the meantime here is a new version, rebased after tableam changes.

Attachment

v13-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

28 March 2019, 10:01:24

> On Tue, Mar 19, 2019 at 2:07 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > I have some comments on the latest v11 patch.
> >
> > Thank you!
>
> In the meantime here is a new version, rebased after tableam changes.

Rebase after refactoring of nbtree insertion scankeys. But so far it's purely
mechanical, just to make it work - I guess I'll need to try to rewrite some
parts of the patch, that don't look natural now, accordingly. And maybe to
leverage dynamic prefix truncation per Peter suggestion.

Attachment

v14-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

11 May 2019, 16:35:41

> On Thu, Mar 28, 2019 at 11:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Rebase after refactoring of nbtree insertion scankeys. But so far it's purely
> mechanical, just to make it work - I guess I'll need to try to rewrite some
> parts of the patch, that don't look natural now, accordingly.

Here is the updated version with the changes I was talking about (mostly about
readability and code cleanup). I've also added few tests for a cursor behaviour.

Attachment

v15-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

29 May 2019, 15:50:51

> On Sat, May 11, 2019 at 6:35 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Here is the updated version with the changes I was talking about (mostly about
> readability and code cleanup). I've also added few tests for a cursor behaviour.

And one more cosmetic rebase after pg_indent.

Attachment

v16-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Floris Van Nee

Date:

01 June 2019, 04:01:38

After some talks with Jesper at PGCon about the Index Skip Scan, I started testing this patch, because it seems to have great potential in speeding up many of our queries (great conference by the way, really enjoyed my first time being there!). I haven't looked in depth to the code itself, but I focused on some testing with real data that we have.

Let me start by sketching our primary use case for this, as it is similar, but slightly different than what was discussed earlier in this thread. I think this use case is something a lot of people who handle timeseries data have. Our database has many tables with timeseries data. We don't update rows, but just insert new rows each time. One example of this would be a table with prices for instruments. Instruments are identified by a column called feedcode. Prices of instrument update with a certain frequency. Each time it updates we insert a new row with the new value and the timestamp at that time. So in the simplest form, you could see it as a table like this:

create table prices (feedcode text, updated_at timestamptz, value float8); -- there'll be some other columns as well, this is just an example

create unique index on prices (feedcode, updated_at desc);

This table perfectly fits the criteria for the index skip scan as there are relatively few distinct feedcodes, but each of them has quite a lot of price inserts (imagine 1000 distinct feedcodes, each of them having one price per second). We normally partition our data by a certain time interval, so let's say we're just looking at one day of prices here. We have other cases with higher update frequencies and/or more distinct values though.

Typical queries on this table involve querying the price at a certain point in time, or simply querying the latest update. If we know the feedcode, this is easy:

select * from prices where feedcode='A' and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1

Unfortunately, this gets hard if you want to know the price of everything at a certain point in time. The query then becomes:

select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc

Up until now (even with this patch) this uses a regular index scan + a unique node which scans the full index, which is terribly slow and is also not constant - as the table grows it becomes slower and slower.

Obviously there are currently already ways to speed this up using the recursive loose index scan, but I think everybody agrees that those are pretty unreadable. However, since they're also several orders of magnitude faster, we actually use them everywhere. Eg.

-- certain point in time

-- first query *all* distinct feedcodes (disregarding time), then look do an index scan for every feedcode found to see if it has an update in the time window we're interested in

-- this essentially means we're doing 2*N index scans

with recursive t as (

select feedcode from prices order by feedcode, updated_at desc limit 1

union all

select n.feedcode from t

cross join lateral (select feedcode from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) n

) select n.* from t

cross join lateral (select * from prices where feedcode=t.feedcode and updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc limit 1) n

-- just latest value

-- if we're interested in just the latest value, it can actually be optimized to just N index scans like this.

-- to make it even more confusing - there's a tradeoff here.. if you're querying a timestamp close to the latest available timestamp, it is often faster to use this method anyway and just put the filter for updated_at inside this query. this avoids the overhead of 2*N index scans, at the expense that the LIMIT 1 may have to scan several tuples before finding one that matches the timestamp criteria. With the 2*N method above we're guaranteed that the first tuple it sees is the correct tuple, but we're doing many more scans...

with recursive t as (

select * from prices order by feedcode, updated_at desc limit 1

union all

select n.* from t

cross join lateral (select * from prices where feedcode > t.feedcode order by feedcode, updated_at desc limit 1) _

) select * from t

I hope this makes our current situation clear. Please do ask if I need to elaborate on something here.

So what changes with this patch? The great thing is that the recursive CTE is not required anymore! This is a big win for readability and it helps performance as well. It makes everything much better. I am really happy with these results. If the index skip scan triggers, it is easily over 100x faster than the naive 'distinct on' query in earlier versions of Postgres. It is also quite a bit faster than the recursive CTE version of the query.

I have a few remarks though. I tested some of our queries with the patch and found that the following query would (with patch) work best now for arbitrary timestamps:

-- first query all distinct values using the index skip scan, then do an index scan for each of these

select r.* from (

select distinct feedcode from prices

) k

cross join lateral (

select *

from prices

where feedcode=k.feedcode and updated_at <= '2019-06-01 12:00'

order by feedcode, updated_at desc

limit 1

) r

While certainly a big improvement over the recursive CTE, it would be nice if the even simpler form with the 'distinct on' would work out of the box using an index skip scan.

select distinct on (feedcode) * from prices where updated_at <= '2019-06-01 12:00' order by feedcode, updated_at desc

As far as I can see, there are two main problems with that at the moment.

1) Only support for Index-Only scan at the moment, not for regular index scans. This was already mentioned upthread and I can understand that it was left out until now to constrain the scope of this. However, if we were to support 'distinct on' + selecting columns that are not part of the index we need a regular index scan instead of the index only scan.

2) The complicating part that we're interested in the value 'at a specific point in time'. This comparison with updated_at messes up all efficiency, as the index scan first looks at the *latest* updated_at for a certain feedcode and then walks the tree until it finds a tuple that matches the updated_at criteria (which possibly never happens, in which case it will happily walk over the full index). I'm actually unsure if there is anything we can do about this (without adding a lot of complexity), aside from just rewriting the query itself in the way I did where we're doing the skip scan for all distinct items followed by a second round of index scans for the specific point in time. I'm struggling a bit to explain this part clearly - I hope it's clear though. Please let me know if I should elaborate. Perhaps it's easiest to see by the difference in speed between the following two queries:

select distinct feedcode from prices -- approx 10ms

select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms

Both use the index skip scan, but the first one is very fast, because it can skip large parts of the index. The second one scans the full index, because it never finds any row that matches the where condition so it can never skip anything.

Thanks for this great patch. It is already very useful and fills a gap that has existed for a long time. It is going to make our queries so much more readable and performant if we won't have to resort to recursive CTEs anymore!

-Floris

Re: Index Skip Scan

From

Floris Van Nee

Date:

01 June 2019, 04:10:23

Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be able to check?

Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on (market, feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail) and filling it with data.

The following query uses an index skip scan and returns just 1 row (incorrect!)

select distinct on (market, feedcode) market, feedcode

from streams.base_price

where market='TEST'

The following query still uses the regular index scan and returns many more rows (correct)

select distinct on (market, feedcode) *

from streams.base_price

where market='TEST'

It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in the index are skipped.

-Floris

Re: Index Skip Scan

From

Rafia Sabih

Date:

01 June 2019, 10:03:30

On Sat, 1 Jun 2019 at 06:10, Floris Van Nee <florisvannee@optiver.com> wrote:
>
> Actually I'd like to add something to this. I think I've found a bug in the current implementation. Would someone be
ableto check? 
>
I am willing to give it a try.
> Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on
(market,feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail)
andfilling it with data. 
>
>
> The following query uses an index skip scan and returns just 1 row (incorrect!)
>
> select distinct on (market, feedcode) market, feedcode
> from streams.base_price
> where market='TEST'
>
> The following query still uses the regular index scan and returns many more rows (correct)
> select distinct on (market, feedcode) *
> from streams.base_price
> where market='TEST'
>
Aren't those two queries different?
select distinct on (market, feedcode) market, feedcode vs select
distinct on (market, feedcode)*
Anyhow, it's just the difference in projection so doesn't matter much.
I verified this scenario at my end and you are right, there is a bug.
Here is my repeatable test case,

create table t (market text, feedcode text, updated_at timestamptz,
value float8) ;
create index on t (market, feedcode, updated_at desc);
insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
'2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
insert into t values('TEST', 'jsgfhdfjd', (select timestamp
'2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);

Now, without the patch,
select distinct on (market, feedcode) market, feedcode from t  where
market='TEST';
 market | feedcode
--------+-----------
 TEST   | abcdef
 TEST   | jsgfhdfjd
(2 rows)
explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
                           QUERY PLAN
----------------------------------------------------------------
 Unique  (cost=12.20..13.21 rows=2 width=13)
   ->  Sort  (cost=12.20..12.70 rows=201 width=13)
         Sort Key: feedcode
         ->  Seq Scan on t  (cost=0.00..4.51 rows=201 width=13)
               Filter: (market = 'TEST'::text)
(5 rows)

And with the patch,
select distinct on (market, feedcode) market, feedcode from t   where
market='TEST';
 market | feedcode
--------+----------
 TEST   | abcdef
(1 row)

explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Index Only Scan using t_market_feedcode_updated_at_idx on t
(cost=0.14..0.29 rows=2 width=13)
   Scan mode: Skip scan
   Index Cond: (market = 'TEST'::text)
(3 rows)

Notice that in the explain statement it shows correct number of rows
to be skipped.

--
Regards,
Rafia Sabih

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

01 June 2019, 10:28:28

> On Sat, Jun 1, 2019 at 6:10 AM Floris Van Nee <florisvannee@optiver.com> wrote:
>
> After some talks with Jesper at PGCon about the Index Skip Scan, I started
> testing this patch, because it seems to have great potential in speeding up
> many of our queries (great conference by the way, really enjoyed my first
> time being there!). I haven't looked in depth to the code itself, but I
> focused on some testing with real data that we have.

Thanks!

> Actually I'd like to add something to this. I think I've found a bug in the
> current implementation. Would someone be able to check?
>
> The following query uses an index skip scan and returns just 1 row (incorrect!)
>
> select distinct on (market, feedcode) market, feedcode
> from streams.base_price
> where market='TEST'
>
> The following query still uses the regular index scan and returns many more
> rows (correct)
> select distinct on (market, feedcode) *
> from streams.base_price
> where market='TEST'

Yes, good catch, I'll investigate. Looks like in the current implementation
something is not quite right, when we have this order of columns in an index
and where clause (e.g. in the examples above everything seems fine if we create
index over (feedcode, market) and not the other way around).

> As far as I can see, there are two main problems with that at the moment.
>
> 1) Only support for Index-Only scan at the moment, not for regular index
>    scans. This was already mentioned upthread and I can understand that it
>    was left out until now to constrain the scope of this. However, if we were
>    to support 'distinct on' + selecting columns that are not part of the
>    index we need a regular index scan instead of the index only scan.

Sure, it's something I hope we can tackle as the next step.

> select distinct feedcode from prices -- approx 10ms
>
> select distinct feedcode from prices where updated_at <= '1999-01-01 00:00' -- approx 200ms
>
> Both use the index skip scan, but the first one is very fast, because it can
> skip large parts of the index. The second one scans the full index, because
> it never finds any row that matches the where condition so it can never skip
> anything.

Interesting, I'll take a closer look.

Re: Index Skip Scan

From

Floris Van Nee

Date:

01 June 2019, 15:33:57

Hi,

Thanks for the helpful replies.

> Yes, good catch, I'll investigate. Looks like in the current implementation
> something is not quite right, when we have this order of columns in an index
> and where clause (e.g. in the examples above everything seems fine if we create
> index over (feedcode, market) and not the other way around).

I did a little bit of investigation and it seems to occur because in pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if there is an equality condition with a constant in the corresponding WHERE clause.

* 1. If the new pathkey's equivalence class contains a constant, and isn't

* below an outer join, then we can disregard it as a sort key. An example:

* SELECT ... WHERE x = 42 ORDER BY x, y;

In planner.c it builds the list of distinct_pathkeys, which is then used for the index skip scan to skip over the first length(distinct_pathkeys) columns when it does a skip. In my query, the distinct_pathkeys list only contains 'feedcode' and not 'market', because 'market' was considered redundant due to the WHERE clause. However, the index skip scan interprets this as that it has to skip over just the first column.

We need to get this list of number of prefix columns to skip differently while building the plan. We need the 'real' number of distinct keys without throwing away the redundant ones. However, I'm not sure if this information can still be obtained while calling create_skipscan_unique_path? But I'm sure people here will have much better ideas than me about this :-)

-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

01 June 2019, 16:57:31

> On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>
> I did a little bit of investigation and it seems to occur because in
> pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if
> there is an equality condition with a constant in the corresponding WHERE
> clause.
> ...
> However, the index skip scan interprets this as that it has to skip over just
> the first column.

Right, passing correct number of columns fixes this particular problem. But
while debugging I've also discovered another related issue, when the current
implementation seems to have a few assumptions, that are not correct if we have
an index condition and a distinct column is not the first in the index. I'll
try to address these in a next version of the patch in the nearest future.

Re: Index Skip Scan

From

Jesper Pedersen

Date:

03 June 2019, 18:13:56

Hi Floris,

On 6/1/19 12:10 AM, Floris Van Nee wrote:
> Given a table definition of (market text, feedcode text, updated_at timestamptz, value float8) and an index on
(market,feedcode, updated_at desc) (note that this table slightly deviates from what I described in my previous mail)
andfilling it with data.
 
> 
> 
> The following query uses an index skip scan and returns just 1 row (incorrect!)
> 
> select distinct on (market, feedcode) market, feedcode
> from streams.base_price
> where market='TEST'
> 
> The following query still uses the regular index scan and returns many more rows (correct)
> select distinct on (market, feedcode) *
> from streams.base_price
> where market='TEST'
> 
> 
> It seems that partially filtering on one of the distinct columns triggers incorrect behavior where too many rows in
theindex are skipped.
 
> 
> 

Thanks for taking a look at the patch, and your feedback on it.

I'll def look into this once I'm back from my travels.

Best regards,
  Jesper

Re: Index Skip Scan

From

Jesper Pedersen

Date:

03 June 2019, 18:16:45

Hi Rafia,

On 6/1/19 6:03 AM, Rafia Sabih wrote:
> Here is my repeatable test case,
> 
> create table t (market text, feedcode text, updated_at timestamptz,
> value float8) ;
> create index on t (market, feedcode, updated_at desc);
> insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
> 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
> '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
> insert into t values('TEST', 'jsgfhdfjd', (select timestamp
> '2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
> timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
> 
> Now, without the patch,
> select distinct on (market, feedcode) market, feedcode from t  where
> market='TEST';
>   market | feedcode
> --------+-----------
>   TEST   | abcdef
>   TEST   | jsgfhdfjd
> (2 rows)
> explain select distinct on (market, feedcode) market, feedcode from t
> where market='TEST';
>                             QUERY PLAN
> ----------------------------------------------------------------
>   Unique  (cost=12.20..13.21 rows=2 width=13)
>     ->  Sort  (cost=12.20..12.70 rows=201 width=13)
>           Sort Key: feedcode
>           ->  Seq Scan on t  (cost=0.00..4.51 rows=201 width=13)
>                 Filter: (market = 'TEST'::text)
> (5 rows)
> 
> And with the patch,
> select distinct on (market, feedcode) market, feedcode from t   where
> market='TEST';
>   market | feedcode
> --------+----------
>   TEST   | abcdef
> (1 row)
> 
> explain select distinct on (market, feedcode) market, feedcode from t
> where market='TEST';
>                                             QUERY PLAN
> ------------------------------------------------------------------------------------------------
>   Index Only Scan using t_market_feedcode_updated_at_idx on t
> (cost=0.14..0.29 rows=2 width=13)
>     Scan mode: Skip scan
>     Index Cond: (market = 'TEST'::text)
> (3 rows)
> 
> Notice that in the explain statement it shows correct number of rows
> to be skipped.
> 

Thanks for your test case; this is very helpful.

For now, I would like to highlight that

  SET enable_indexskipscan = OFF

can be used for testing with the patch applied.

Dmitry and I will look at the feedback provided.

Best regards,
  Jesper

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

03 June 2019, 20:31:33

> On Sat, Jun 1, 2019 at 6:57 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Sat, Jun 1, 2019 at 5:34 PM Floris Van Nee <florisvannee@optiver.com> wrote:
> >
> > I did a little bit of investigation and it seems to occur because in
> > pathkeys.c the function pathkey_is_redundant considers pathkeys redundant if
> > there is an equality condition with a constant in the corresponding WHERE
> > clause.
> > ...
> > However, the index skip scan interprets this as that it has to skip over just
> > the first column.
>
> Right, passing correct number of columns fixes this particular problem. But
> while debugging I've also discovered another related issue, when the current
> implementation seems to have a few assumptions, that are not correct if we have
> an index condition and a distinct column is not the first in the index. I'll
> try to address these in a next version of the patch in the nearest future.

So, as mentioned above, there were a few problems, namely the number of
distinct_pathkeys with and without redundancy, and using _bt_search when the
order of distinct columns doesn't match the index. As far as I can see the
problem in the latter case (when we have an index condition) is that it's still
possible to find a value, but lastItem value after the search is always zero
(due to _bt_checkkeys filtering) and _bt_next stops right away.

To address this, probably we can do something like in the attached patch.
Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
the same, but without the constants elimination. It's being used then for
getting the real number of distinct keys, and to check the order of the columns
to not consider index skip scan if it's different. Hope it doesn't
look too hacky.

Also I've noticed, that the current implementation wouldn't work e.g. for:

    select distinct a, a from table;

because in this case an IndexPath is hidden behind a ProjectionPath. For now I
guess it's fine, but probably it's possible here to apply skip scan too.

Attachment

v17-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Floris Van Nee

Date:

05 June 2019, 19:39:29

> To address this, probably we can do something like in the attached patch.
> Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
> the same, but without the constants elimination. It's being used then for
> getting the real number of distinct keys, and to check the order of the columns
> to not consider index skip scan if it's different. Hope it doesn't
> look too hacky.
>

Thanks! I've verified that it works now.

I was wondering if we're not too strict in some cases now though. Consider the following queries:

postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2';

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------

Index Only Scan using t_m_f_t_idx on t (cost=0.29..11.60 rows=40 width=5) (actual time=0.056..0.469 rows=10 loops=1)

Scan mode: Skip scan

Index Cond: (m = 'M2'::text)

Heap Fetches: 10

Planning Time: 0.095 ms

Execution Time: 0.490 ms

(6 rows)

postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2';

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Unique (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 loops=1)

-> Index Only Scan using t_m_f_t_idx on t (cost=0.29..824.70 rows=10052 width=5) (actual time=0.087..8.524 rows=10000 loops=1)

Index Cond: (m = 'M2'::text)

Heap Fetches: 10000

Planning Time: 0.078 ms

Execution Time: 10.944 ms

(6 rows)

This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency.

-Floris

Re: Index Skip Scan

From

Jesper Pedersen

Date:

13 June 2019, 16:31:54

Hi,

On 6/5/19 3:39 PM, Floris Van Nee wrote:
> Thanks! I've verified that it works now.

Here is a rebased version.

> I was wondering if we're not too strict in some cases now though. Consider the following queries:

[snip]

> This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip
scancould be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the
WHERE-clausethat 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement
forconsistency.
 
> 

I think it would be good to get more feedback on the patch in general 
before looking at further optimizations. We should of course fix any 
bugs that shows up.

Thanks for your testing and feedback !

Best regards,
  Jesper

Attachment

v18-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

James Coleman

Date:

14 June 2019, 03:40:54

I've previously noted upthread (along with several others), that I
don't see a good reason to limit this new capability to index only
scans. In addition to the reasons upthread, this also prevents using
the new feature on physical replicas since index only scans require
visibility map (IIRC) information that isn't safe to make assumptions
about on a replica.

That being said, it strikes me that this likely indicates an existing
architecture issue. I was discussing the problem at PGCon with Andres
and Heiki with respect to an index scan variation I've been working on
myself. In short, it's not clear to me why we want index only scans
and index scans to be entirely separate nodes, rather than optional
variations within a broader index scan node. The problem becomes even
more clear as we continue to add additional variants that lie on
different axis, since we end up with an ever multiplying number of
combinations.

In that discussion no one could remember why it'd been done that way,
but I'm planning to try to find the relevant threads in the archives
to see if there's anything in particular blocking combining them.

I generally dislike gating improvements like this on seemingly
tangentially related refactors, but I will make the observation that
adding the skip scan on top of such a refactored index scan node would
make this a much more obvious and complete win.

As I noted to Jesper at PGCon I'm happy to review the code in detail
also, but likely won't get to it until later this week or next week at
the earliest.

Jesper: Is there anything still on your list of things to change about
the patch? Or would now be a good time to look hard at the code?

James Coleman

Re: Index Skip Scan

From

David Rowley

Date:

14 June 2019, 07:19:49

On Fri, 14 Jun 2019 at 04:32, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Here is a rebased version.

Hi Jesper,

I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)

First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.

I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.

The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for.  This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for().  The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join.  The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.

I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys().  Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg.  I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.

Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ )  However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.

Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.

I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:

# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 1000000
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using abc_a_b_idx on abc  (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
   Scan mode: Skip scan
   Heap Fetches: 1000
 Planning Time: 5.616 ms
 Execution Time: 21.791 ms
(5 rows)


# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
                                                                QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.42..50104.43 rows=200 width=12) (actual
time=1.201..555.280 rows=1000 loops=1)
   ->  Index Scan using abc_a_b_idx on abc  (cost=0.42..47604.43
rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000
loops=1)
 Planning Time: 0.102 ms
 Execution Time: 555.407 ms
(4 rows)

[1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

Jesper Pedersen

Date:

14 June 2019, 17:19:31

Hi James,

On 6/13/19 11:40 PM, James Coleman wrote:
> I've previously noted upthread (along with several others), that I
> don't see a good reason to limit this new capability to index only
> scans. In addition to the reasons upthread, this also prevents using
> the new feature on physical replicas since index only scans require
> visibility map (IIRC) information that isn't safe to make assumptions
> about on a replica.
> 
> That being said, it strikes me that this likely indicates an existing
> architecture issue. I was discussing the problem at PGCon with Andres
> and Heiki with respect to an index scan variation I've been working on
> myself. In short, it's not clear to me why we want index only scans
> and index scans to be entirely separate nodes, rather than optional
> variations within a broader index scan node. The problem becomes even
> more clear as we continue to add additional variants that lie on
> different axis, since we end up with an ever multiplying number of
> combinations.
> 
> In that discussion no one could remember why it'd been done that way,
> but I'm planning to try to find the relevant threads in the archives
> to see if there's anything in particular blocking combining them.
> 
> I generally dislike gating improvements like this on seemingly
> tangentially related refactors, but I will make the observation that
> adding the skip scan on top of such a refactored index scan node would
> make this a much more obvious and complete win.
>

Thanks for your feedback !

> As I noted to Jesper at PGCon I'm happy to review the code in detail
> also, but likely won't get to it until later this week or next week at
> the earliest.
> 
> Jesper: Is there anything still on your list of things to change about
> the patch? Or would now be a good time to look hard at the code?
> 

It would be valuable to have test cases for your use-cases which works 
now, or should work.

I revived Thomas' patch because it covered our use-cases and saw it as a 
much needed feature.

Thanks again !

Best regards,
  Jesper

Re: Index Skip Scan

From

Jesper Pedersen

Date:

14 June 2019, 17:24:20

Hi David,

On 6/14/19 3:19 AM, David Rowley wrote:
> I read over this thread a few weeks ago while travelling back from
> PGCon. (I wish I'd read it on the outward trip instead since it would
> have been good to talk about it in person.)
> 
> First off. I think this is a pretty great feature. It certainly seems
> worthwhile working on it.
> 
> I've looked over the patch just to get a feel for how the planner part
> works and I have a few ideas to share.
> 
> The code in create_distinct_paths() I think should work a different
> way. I think it would be much better to add a new field to Path and
> allow a path to know what keys it is distinct for.  This sort of goes
> back to an idea I thought about when developing unique joins
> (9c7f5229ad) about an easier way to detect fields that a relation is
> unique for. I've been calling these "UniqueKeys" in a few emails [1].
> The idea was to tag these onto RelOptInfo to mention which columns or
> exprs a relation is unique by so that we didn't continuously need to
> look at unique indexes in all the places that call
> relation_has_unique_index_for().  The idea there was that unique joins
> would know when a join was unable to duplicate rows. If the outer side
> of a join didn't duplicate the inner side, then the join RelOptInfo
> could keep the UniqueKeys from the inner rel, and vice-versa. If both
> didn't duplicate then the join rel would obtain the UniqueKeys from
> both sides of the join.  The idea here is that this would be a better
> way to detect unique joins, and also when it came to the grouping
> planner we'd also know if the distinct or group by should be a no-op.
> DISTINCT could be skipped, and GROUP BY could do a group aggregate
> without any sort.
> 
> I think these UniqueKeys ties into this work, perhaps not adding
> UniqueKeys to RelOptInfo, but just to Path so that we create paths
> that have UniqueKeys during create_index_paths() based on some
> uniquekeys that are stored in PlannerInfo, similar to how we create
> index paths in build_index_paths() by checking if the index
> has_useful_pathkeys().  Doing it this way would open up more
> opportunities to use skip scans. For example, semi-joins and
> anti-joins could make use of them if the uniquekeys covered the entire
> join condition. With this idea, the code you've added in
> create_distinct_paths() can just search for the cheapest path that has
> the correct uniquekeys, or if none exist then just do the normal
> sort->unique or hash agg.  I'm not entirely certain how we'd instruct
> a semi/anti joined relation to build such paths, but that seems like a
> problem that could be dealt with when someone does the work to allow
> skip scans to be used for those.
> 
> Also, I'm not entirely sure that these UniqueKeys should make use of
> PathKey since there's no real need to know about pk_opfamily,
> pk_strategy, pk_nulls_first as those all just describe how the keys
> are ordered. We just need to know if they're distinct or not. All
> that's left after removing those fields is pk_eclass, so could
> UniqueKeys just be a list of EquivalenceClass? or perhaps even a
> Bitmapset with indexes into PlannerInfo->ec_classes (that might be
> premature for not since we've not yet got
> https://commitfest.postgresql.org/23/1984/ or
> https://commitfest.postgresql.org/23/2019/ )  However, if we did use
> PathKey, that does allow us to quickly check if the UniqueKeys are
> contained within the PathKeys, since pathkeys are canonical which
> allows us just to compare their memory address to know if two are
> equal, however, if you're storing eclasses we could probably get the
> same just by comparing the address of the eclass to the pathkey's
> pk_eclass.
> 
> Otherwise, I think how you're making use of paths in
> create_distinct_paths() and create_skipscan_unique_path() kind of
> contradicts how they're meant to be used.
> 

Thank you very much for this feedback ! Will need to revise the patch 
based on this.

> I also agree with James that this should not be limited to Index Only
> Scans. From testing the patch, the following seems pretty strange to
> me:
> 
> # create table abc (a int, b int, c int);
> CREATE TABLE
> # insert into abc select a,b,1 from generate_Series(1,1000) a,
> generate_Series(1,1000) b;
> INSERT 0 1000000
> # create index on abc(a,b);
> CREATE INDEX
> # explain analyze select distinct on (a) a,b from abc order by a,b; --
> this is fast.
>                                                           QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>   Index Only Scan using abc_a_b_idx on abc  (cost=0.42..85.00 rows=200
> width=8) (actual time=0.260..20.518 rows=1000 loops=1)
>     Scan mode: Skip scan
>     Heap Fetches: 1000
>   Planning Time: 5.616 ms
>   Execution Time: 21.791 ms
> (5 rows)
> 
> 
> # explain analyze select distinct on (a) a,b,c from abc order by a,b;
> -- Add one more column and it's slow.
>                                                                  QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------
>   Unique  (cost=0.42..50104.43 rows=200 width=12) (actual
> time=1.201..555.280 rows=1000 loops=1)
>     ->  Index Scan using abc_a_b_idx on abc  (cost=0.42..47604.43
> rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000
> loops=1)
>   Planning Time: 0.102 ms
>   Execution Time: 555.407 ms
> (4 rows)
> 
> [1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r
> 

Ok, understood.

I have put the CF entry into "Waiting on Author".

Best regards,
  Jesper

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

16 June 2019, 15:03:36

> On Fri, Jun 14, 2019 at 9:20 AM David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> The code in create_distinct_paths() I think should work a different
> way. I think it would be much better to add a new field to Path and
> allow a path to know what keys it is distinct for.  This sort of goes
> back to an idea I thought about when developing unique joins
> (9c7f5229ad) about an easier way to detect fields that a relation is
> unique for. I've been calling these "UniqueKeys" in a few emails [1].
> The idea was to tag these onto RelOptInfo to mention which columns or
> exprs a relation is unique by so that we didn't continuously need to
> look at unique indexes in all the places that call
> relation_has_unique_index_for().  The idea there was that unique joins
> would know when a join was unable to duplicate rows. If the outer side
> of a join didn't duplicate the inner side, then the join RelOptInfo
> could keep the UniqueKeys from the inner rel, and vice-versa. If both
> didn't duplicate then the join rel would obtain the UniqueKeys from
> both sides of the join.  The idea here is that this would be a better
> way to detect unique joins, and also when it came to the grouping
> planner we'd also know if the distinct or group by should be a no-op.
> DISTINCT could be skipped, and GROUP BY could do a group aggregate
> without any sort.
>
> I think these UniqueKeys ties into this work, perhaps not adding
> UniqueKeys to RelOptInfo, but just to Path so that we create paths
> that have UniqueKeys during create_index_paths() based on some
> uniquekeys that are stored in PlannerInfo, similar to how we create
> index paths in build_index_paths() by checking if the index
> has_useful_pathkeys().  Doing it this way would open up more
> opportunities to use skip scans. For example, semi-joins and
> anti-joins could make use of them if the uniquekeys covered the entire
> join condition. With this idea, the code you've added in
> create_distinct_paths() can just search for the cheapest path that has
> the correct uniquekeys, or if none exist then just do the normal
> sort->unique or hash agg.  I'm not entirely certain how we'd instruct
> a semi/anti joined relation to build such paths, but that seems like a
> problem that could be dealt with when someone does the work to allow
> skip scans to be used for those.
>
> Also, I'm not entirely sure that these UniqueKeys should make use of
> PathKey since there's no real need to know about pk_opfamily,
> pk_strategy, pk_nulls_first as those all just describe how the keys
> are ordered. We just need to know if they're distinct or not. All
> that's left after removing those fields is pk_eclass, so could
> UniqueKeys just be a list of EquivalenceClass? or perhaps even a
> Bitmapset with indexes into PlannerInfo->ec_classes (that might be
> premature for not since we've not yet got
> https://commitfest.postgresql.org/23/1984/ or
> https://commitfest.postgresql.org/23/2019/ )  However, if we did use
> PathKey, that does allow us to quickly check if the UniqueKeys are
> contained within the PathKeys, since pathkeys are canonical which
> allows us just to compare their memory address to know if two are
> equal, however, if you're storing eclasses we could probably get the
> same just by comparing the address of the eclass to the pathkey's
> pk_eclass.

Interesting, thanks for sharing this.

> I also agree with James that this should not be limited to Index Only
> Scans. From testing the patch, the following seems pretty strange to
> me:
> ...
> explain analyze select distinct on (a) a,b from abc order by a,b;
> explain analyze select distinct on (a) a,b,c from abc order by a,b;
> ...

Yes, but I believe this limitation is not intrinsic to the idea of the patch,
and the very same approach can be used for IndexScan in the second example.
I've already prepared a new version to enable it for IndexScan with minimal
modifications, just need to rebase it on top of the latest changes and then
can post it. Although still there would be some limitations I guess (e.g. the
first thing I've stumbled upon is that an index scan with a filter wouldn't
work well, because checking qual causes with a filter happens after
ExecScanFetch)

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

19 June 2019, 13:57:29

> On Sun, Jun 16, 2019 at 5:03 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > I also agree with James that this should not be limited to Index Only
> > Scans. From testing the patch, the following seems pretty strange to
> > me:
> > ...
> > explain analyze select distinct on (a) a,b from abc order by a,b;
> > explain analyze select distinct on (a) a,b,c from abc order by a,b;
> > ...
>
> Yes, but I believe this limitation is not intrinsic to the idea of the patch,
> and the very same approach can be used for IndexScan in the second example.
> I've already prepared a new version to enable it for IndexScan with minimal
> modifications, just need to rebase it on top of the latest changes and then
> can post it. Although still there would be some limitations I guess (e.g. the
> first thing I've stumbled upon is that an index scan with a filter wouldn't
> work well, because checking qual causes with a filter happens after
> ExecScanFetch)

Here is what I was talking about, POC for an integration with index scan. About
using of create_skipscan_unique_path and suggested planner improvements, I hope
together with Jesper we can come up with something soon.

Attachment

v18-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Jesper Pedersen

Date:

20 June 2019, 13:20:39

Hi,

On 6/19/19 9:57 AM, Dmitry Dolgov wrote:
> Here is what I was talking about, POC for an integration with index scan. About
> using of create_skipscan_unique_path and suggested planner improvements, I hope
> together with Jesper we can come up with something soon.
> 

I made some minor changes, but I did move all the code in 
create_distinct_paths() under enable_indexskipscan to limit the overhead 
if skip scan isn't enabled.

Attached is v20, since the last patch should have been v19.

Best regards,
  Jesper

Attachment

v20-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Floris Van Nee

Date:

22 June 2019, 10:17:04

The following sql statement seems to have incorrect results - some logic in the backwards scan is currently not
entirelyright. 

-Floris


drop table if exists a;
create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a desc, b desc;
explain select distinct on (a) a,b from a order by a desc, b desc;

DROP TABLE
CREATE TABLE
INSERT 0 50000
CREATE INDEX
ANALYZE
 a |   b
---+-------
 5 | 10000
 5 |     1
 4 |     1
 3 |     1
 2 |     1
 1 |     1
(6 rows)

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Only Scan Backward using a_a_b_idx on a  (cost=0.29..1.45 rows=5 width=8)
   Scan mode: Skip scan
(2 rows)

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

22 June 2019, 15:44:19

> On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee@optiver.com> wrote:
> The following sql statement seems to have incorrect results - some logic in
> the backwards scan is currently not entirely right.

Thanks for testing! You're right, looks like in the current implementation in
case of backwards scan there is one unnecessary extra step forward. It seems
this mistake was made, since I was concentrating only on the backward scans
with a cursor, and used not exactly correct approach to wrap up after a scan
was finished. Give me a moment, I'll tighten it up.

Re: Index Skip Scan

From

Floris Van Nee

Date:

22 June 2019, 22:15:17

> Thanks for testing! You're right, looks like in the current implementation in
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.

Thanks. Looking forward to it. I think I found some other strange behavior. Given the same table as in my previous e-mail, the following queries also return inconsistent results. I spent some time trying to debug it, but can't easily pinpoint the cause. It looks like it also skips over one value too much, my guess is during _bt_skippage call in _bt_skip?

Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents on an IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated I guess.

-- this one is OK

postgres=# select distinct on (a) a,b from a where b>1;

a | b

---+---

1 | 2

2 | 2

3 | 2

4 | 2

5 | 2

(5 rows)

-- this one is not OK, it seems to skip too much

postgres=# select distinct on (a) a,b from a where b=2;

a | b

---+---

1 | 2

3 | 2

5 | 2

(3 rows)

Re: Index Skip Scan

From

Peter Geoghegan

Date:

23 June 2019, 01:01:57

On Sat, Jun 22, 2019 at 3:15 PM Floris Van Nee <florisvannee@optiver.com> wrote:
> Perhaps a question: when stepping through code in GDB, is there an easy way to pretty print for example the contents
onan IndexTuple? I saw there's some tools out there that can pretty print plans, but viewing tuples is more complicated
Iguess.
 

Try the attached patch -- it has DEBUG1 traces with the contents of
index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.

The patch should be considered a very rough hack, for now. It leaks
memory like crazy. But I think that you'll find it helpful.

-- 
Peter Geoghegan

Attachment

0012-Index-scan-positioning-DEBUG1-instrumentation.patch

Re: Index Skip Scan

From

Floris Van Nee

Date:

23 June 2019, 11:04:30

> Try the attached patch -- it has DEBUG1 traces with the contents of
> index tuples from key points during index scans, allowing you to see
> what's going on both as a B-Tree is descended, and as a range scan is
> performed. It also shows details of the insertion scankey that is set
> up within _bt_first(). This hasn't been adopted to the patch at all,
> so you'll probably need to do that.

Thanks! That works quite nicely.

I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,10000), (2, 1), (2,2), ... (2,10000) until (5,10000).

Running the query

SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at a custom location in the tree, so it just starts from the beginning and searches until it finds the first tuple (1,2).

After the first tuple was yielded, _bt_skip kicks in. It constructs an insert scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on this, it finds the first tuple larger than this: (2,1). This is not the tuple that it stops at though, because after that it does this:

if (ScanDirectionIsForward(dir))

/* Move back for _bt_next */

offnum = OffsetNumberPrev(offnum);

....

/* Now read the data */

if (!_bt_readpage(scan, dir, offnum))

{

* There's no actually-matching data on this page. Try to advance to

* the next page. Return false if there's no matching data at all.

LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);

if (!_bt_steppage(scan, dir))

First, it takes the previous tuple with OffsetNumberPrev (so tuple before (2,1), which is (1,10000)). This is done, because if this tuple were to be returned, there would be a call to _bt_next afterwards, which would then conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes things up, because it only reads tuples that match all the provided keys (so where b=2). The only tuple it'll return is (2,2). This will be the tuple that is set, however, on the call to _bt_next, the tuple is first incremented, so we'll find (2,3) there which doesn't match our keys. This leads it to skip (2,2) in our result set.

I was wondering about something else: don't we also have another problem with updating this current index tuple by skipping before calling btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples when scan->kill_prior_tuple is true. I'm not too familiar with the concept of killing dead tuples while doing index scans, but by looking at the code it seems to be possible that btgettuple returns a tuple, caller processes it and sets kill_prior_tuple to true in order to have it killed. However, then the skip scan kicks in, which sets the current tuple to a completely different tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my reasoning correct here or am I missing something?

-Floris

Re: Index Skip Scan

From

Tomas Vondra

Date:

23 June 2019, 13:10:31

Hi,

I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:


1) config.sgml

I'm not sure why the enable_indexskipscan section says

    This parameter requires that <varname>enable_indexonlyscan</varname>
    is <literal>on</literal>.

I suppose it's the same thing as for enable_indexscan, and we don't say
anything like that for that GUC.


2) indices.sgml

The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?


3) nbtsearch.c

  _bt_skip - comments are formatted incorrectly
  _bt_update_skip_scankeys - missing comment
  _bt_scankey_within_page - missing comment

4) explain.c

There are duplicate blocks of code for IndexScan and IndexOnlyScan:

    if (indexscan->skipPrefixSize > 0)
    {
        if (es->format != EXPLAIN_FORMAT_TEXT)
            ExplainPropertyInteger("Distinct Prefix", NULL,
                                   indexscan->skipPrefixSize,
                                   es);
    }

I suggest we wrap this into a function ExplainIndexSkipScanKeys() or
something like that.

Also, there's this:

    if (((IndexScan *) plan)->skipPrefixSize > 0)
    {
        ExplainPropertyText("Scan mode", "Skip scan", es);
    }

That does not make much sense - there's just a single 'scan mode' value.
So I suggest we do the same thing as for unique joins, i.e. 

    ExplainPropertyBool("Skip Scan",
                        (((IndexScan *) plan)->skipPrefixSize > 0),
                        es);


5) nodeIndexOnlyScan.c

In ExecInitIndexOnlyScan, we should initialize the ioss_ fields a bit
later, with the existing ones. This is just cosmetic issue, though.


6) nodeIndexScan.c

I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?


7) pathnode.c

I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.


8) execnodes.h

The comment before IndexScanState mentions new field NumDistinctKeys,
but there's no such field added to the struct.


9) pathnodes.h

I don't understand what the uniq_distinct_pathkeys comment says :-(


10) plannodes.h

The naming of the new field (skipPrefixSize) added to IndexScan and
IndexOnlyScan is clearly inconsistent with the naming convention of the
existing fields.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

24 June 2019, 11:44:14

> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>
> However, _bt_readpage messes things up, because it only reads tuples that
> match all the provided keys (so where b=2)

Right, the problem you've reported first had a similar origins. I'm starting to
think that probably using _bt_readpage just like that is not exactly right
thing to do, since the correct element is already found and there is no need to
check if tuples are matching after one step back. I'll try to avoid it in the
next version of patch.

> I was wondering about something else: don't we also have another problem with
> updating this current index tuple by skipping before calling
> btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
> when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
> killing dead tuples while doing index scans, but by looking at the code it
> seems to be possible that btgettuple returns a tuple, caller processes it and
> sets kill_prior_tuple to true in order to have it killed. However, then the
> skip scan kicks in, which sets the current tuple to a completely different
> tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
> reasoning correct here or am I missing something?

Need to check, but probably we can avoid that by setting kill_prior_tuple to
false in case of skip scan as in index_rescan.

> On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> I've done some initial review on v20 - just reading through the code, no
> tests at this point. Here are my comments:

Thank you!

> 2) indices.sgml
>
> The new section is somewhat unclear and difficult to understand, I think
> it'd deserve a rewording. Also, I wonder if we really should link to the
> wiki page about FSM problems. We have a couple of wiki links in the sgml
> docs, but those seem more generic while this seems as a development page
> that might disapper. But more importantly, that wiki page does not say
> anything about "Loose Index scans" so is it even the right wiki page?

Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
mentioned it too, so probably I'm going to remove it (and instead describe the
idea in a few words in the documentation itself).

> 6) nodeIndexScan.c
>
> I wonder why we even add and initialize the ioss_ fields for IndexScan
> nodes, when the skip scans require index-only scans?

Skip scans required index-only scans until recently, when the patch was updated
to incorporate the same approach for index scans too. My apologies, looks like
documentation and some commentaries are still inconsistent about this topic.

> 7) pathnode.c
>
> I wonder how much was the costing discussed. It seems to me the logic is
> fairly similar to ideas discussed in the incremental sort patch, and
> we've been discussing some weak points there. I'm not sure how much we
> need to consider those issues here.

Can you please elaborate in a few words, which issues do you mean? Is it about
non uniform distribution of distinct values? If so, I believe it's partially
addressed when we have to skip too often, by searching a next index page.
Although yeah, there is still an assumption about uniform distribution of
distinct groups at the planning time.

> 9) pathnodes.h
>
> I don't understand what the uniq_distinct_pathkeys comment says :-(

Yeah, sorry, I'll try to improve the commentaries in the next version, where
I'm going to address all the feedback.

Re: Index Skip Scan

From

Tomas Vondra

Date:

24 June 2019, 13:04:10

On Mon, Jun 24, 2019 at 01:44:14PM +0200, Dmitry Dolgov wrote:
>> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>>
>> However, _bt_readpage messes things up, because it only reads tuples that
>> match all the provided keys (so where b=2)
>
>Right, the problem you've reported first had a similar origins. I'm starting to
>think that probably using _bt_readpage just like that is not exactly right
>thing to do, since the correct element is already found and there is no need to
>check if tuples are matching after one step back. I'll try to avoid it in the
>next version of patch.
>
>> I was wondering about something else: don't we also have another problem with
>> updating this current index tuple by skipping before calling
>> btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
>> when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
>> killing dead tuples while doing index scans, but by looking at the code it
>> seems to be possible that btgettuple returns a tuple, caller processes it and
>> sets kill_prior_tuple to true in order to have it killed. However, then the
>> skip scan kicks in, which sets the current tuple to a completely different
>> tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
>> reasoning correct here or am I missing something?
>
>Need to check, but probably we can avoid that by setting kill_prior_tuple to
>false in case of skip scan as in index_rescan.
>
>> On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> I've done some initial review on v20 - just reading through the code, no
>> tests at this point. Here are my comments:
>
>Thank you!
>
>> 2) indices.sgml
>>
>> The new section is somewhat unclear and difficult to understand, I think
>> it'd deserve a rewording. Also, I wonder if we really should link to the
>> wiki page about FSM problems. We have a couple of wiki links in the sgml
>> docs, but those seem more generic while this seems as a development page
>> that might disapper. But more importantly, that wiki page does not say
>> anything about "Loose Index scans" so is it even the right wiki page?
>
>Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
>mentioned it too, so probably I'm going to remove it (and instead describe the
>idea in a few words in the documentation itself).
>
>> 6) nodeIndexScan.c
>>
>> I wonder why we even add and initialize the ioss_ fields for IndexScan
>> nodes, when the skip scans require index-only scans?
>
>Skip scans required index-only scans until recently, when the patch was updated
>to incorporate the same approach for index scans too. My apologies, looks like
>documentation and some commentaries are still inconsistent about this topic.
>

Yes, if that's the case then various bits of docs and comments are rather
misleading, ant fields in IndexScanState should be named 'iss_'.


>> 7) pathnode.c
>>
>> I wonder how much was the costing discussed. It seems to me the logic is
>> fairly similar to ideas discussed in the incremental sort patch, and
>> we've been discussing some weak points there. I'm not sure how much we
>> need to consider those issues here.
>
>Can you please elaborate in a few words, which issues do you mean? Is it about
>non uniform distribution of distinct values? If so, I believe it's partially
>addressed when we have to skip too often, by searching a next index page.
>Although yeah, there is still an assumption about uniform distribution of
>distinct groups at the planning time.
>

Right, it's mostly about what happens when the group sizes are not close
to average size. The question is what happens in such cases - how much
slower will the plan be, compared to "current" plan without a skip scan?

I don't have a very good idea of the additional overhead associated with
skip-scans - presumably it's a bit more expensive, right?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Thomas Munro

Date:

02 July 2019, 09:00:06

On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Attached is v20, since the last patch should have been v19.

I took this for a quick spin today.  The DISTINCT ON support is nice
and I think it will be very useful.  I've signed up to review it and
will have more to say later.  But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.

1.  SELECT COUNT(DISTINCT i) FROM t could benefit from this.  (Or
AVG(DISTINCT ...) or any other aggregate).  Right now you get a seq
scan, with the sort/unique logic inside the Aggregate node.  If you
write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
a skip scan that is much faster in good cases.  I suppose you could
have a process_distinct_aggregates() in planagg.c that recognises
queries of the right form and generates extra paths a bit like
build_minmax_path() does.  I think it's probably better to consider
that in the grouping planner proper instead.  I'm not sure.

2.  SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards.  Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards.  Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20).  That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().

-- 
Thomas Munro
https://enterprisedb.com

Re: Index Skip Scan

From

David Rowley

Date:

02 July 2019, 12:27:09

On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> I took this for a quick spin today.  The DISTINCT ON support is nice
> and I think it will be very useful.  I've signed up to review it and
> will have more to say later.  But today I had a couple of thoughts
> after looking into how src/backend/optimizer/plan/planagg.c works and
> wondering how to do some more skipping tricks with the existing
> machinery.
>
> 1.  SELECT COUNT(DISTINCT i) FROM t could benefit from this.  (Or
> AVG(DISTINCT ...) or any other aggregate).  Right now you get a seq
> scan, with the sort/unique logic inside the Aggregate node.  If you
> write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
> a skip scan that is much faster in good cases.  I suppose you could
> have a process_distinct_aggregates() in planagg.c that recognises
> queries of the right form and generates extra paths a bit like
> build_minmax_path() does.  I think it's probably better to consider
> that in the grouping planner proper instead.  I'm not sure.

I think to make that happen we'd need to do a bit of an overhaul in
nodeAgg.c to allow it to make use of presorted results instead of
having the code blindly sort rows for each aggregate that has a
DISTINCT or ORDER BY.  The planner would also then need to start
requesting paths with pathkeys that suit the aggregate and also
probably dictate the order the AggRefs should be evaluated to allow
all AggRefs to be evaluated that can be for each sort order.  Once
that part is done then the aggregates could then also request paths
with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd
need to be pretty careful with that one since there may be other parts
of the query that require that all rows are fed in, not just 1 row per
value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
just feed through 1 row for each "i" value, since we need only the
ones that have "z > 0".  Getting the first part of this solved is much
more important than making skip scans work here, I'd say. I think we
need to be able to walk before we can run with DISTINCT  / ORDER BY
aggs.

> 2.  SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
> you're allowed to go forwards.  Same for SELECT i, MAX(j) FROM t GROUP
> BY i if you're allowed to go backwards.  Those queries are equivalent
> to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
> (though as Floris noted, the backwards version gives the wrong answers
> with v20).  That does seem like a much more specific thing applicable
> only to MIN and MAX, and I think preprocess_minmax_aggregates() could
> be taught to handle that sort of query, building an index only scan
> path with skip scan in build_minmax_path().

For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.

The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.

[1] https://www.postgresql.org/message-id/CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

Jesper Pedersen

Date:

03 July 2019, 17:31:21

Hi Thomas and David,

Thanks for the feedback !

On 7/2/19 8:27 AM, David Rowley wrote:
> On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
>> I took this for a quick spin today.  The DISTINCT ON support is nice
>> and I think it will be very useful.  I've signed up to review it and
>> will have more to say later.  But today I had a couple of thoughts
>> after looking into how src/backend/optimizer/plan/planagg.c works and
>> wondering how to do some more skipping tricks with the existing
>> machinery.
>>
>> 1.  SELECT COUNT(DISTINCT i) FROM t could benefit from this.  (Or
>> AVG(DISTINCT ...) or any other aggregate).  Right now you get a seq
>> scan, with the sort/unique logic inside the Aggregate node.  If you
>> write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
>> a skip scan that is much faster in good cases.  I suppose you could
>> have a process_distinct_aggregates() in planagg.c that recognises
>> queries of the right form and generates extra paths a bit like
>> build_minmax_path() does.  I think it's probably better to consider
>> that in the grouping planner proper instead.  I'm not sure.
> 
> I think to make that happen we'd need to do a bit of an overhaul in
> nodeAgg.c to allow it to make use of presorted results instead of
> having the code blindly sort rows for each aggregate that has a
> DISTINCT or ORDER BY.  The planner would also then need to start
> requesting paths with pathkeys that suit the aggregate and also
> probably dictate the order the AggRefs should be evaluated to allow
> all AggRefs to be evaluated that can be for each sort order.  Once
> that part is done then the aggregates could then also request paths
> with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd
> need to be pretty careful with that one since there may be other parts
> of the query that require that all rows are fed in, not just 1 row per
> value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
> just feed through 1 row for each "i" value, since we need only the
> ones that have "z > 0".  Getting the first part of this solved is much
> more important than making skip scans work here, I'd say. I think we
> need to be able to walk before we can run with DISTINCT  / ORDER BY
> aggs.
> 

I agree that the above is outside of scope for the first patch -- I 
think the goal should be the simple use-cases for IndexScan and 
IndexOnlyScan.

Maybe we should expand [1] with possible cases, so we don't lose track 
of the ideas.

>> 2.  SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
>> you're allowed to go forwards.  Same for SELECT i, MAX(j) FROM t GROUP
>> BY i if you're allowed to go backwards.  Those queries are equivalent
>> to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
>> (though as Floris noted, the backwards version gives the wrong answers
>> with v20).  That does seem like a much more specific thing applicable
>> only to MIN and MAX, and I think preprocess_minmax_aggregates() could
>> be taught to handle that sort of query, building an index only scan
>> path with skip scan in build_minmax_path().
> 
> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
> 

Ok.

> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.
> 

I'm looking at this, and will keep this in mind.

Thanks !

[1] https://wiki.postgresql.org/wiki/Loose_indexscan

Best regards,
  Jesper

Re: Index Skip Scan

From

David Fetter

Date:

03 July 2019, 19:46:28

On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
> On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.

Is that actually true, though? I can see unique {a, b, c} => unique
{a, b}, but for example:

a | b | c
--|---|--
1 | 2 | 3
1 | 2 | 4
1 | 2 | 5

is unique on {a, b, c} but not on {a, b}, at least as I understand the
way "unique" is used here, which is 3 distinct {a, b, c}, but only one
{a, b}.

Or I could be missing something obvious, and in that case, please
ignore.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Index Skip Scan

From

James Coleman

Date:

03 July 2019, 21:02:44

On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david@fetter.org> wrote:
>
> On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
> > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > The more I think about these UniqueKeys, the more I think they need to
> > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > should be equivalent to { y, x }, but with PathKeys, that's not the
> > case, since the order of each key matters. UniqueKeys equivalent
> > version of pathkeys_contained_in() would not care about the order of
> > individual keys, it would say things like, { a, b, c } is contained in
> > { b, a }, since if the path is unique on columns { b, a } then it must
> > also be unique on { a, b, c }.
>
> Is that actually true, though? I can see unique {a, b, c} => unique
> {a, b}, but for example:
>
> a | b | c
> --|---|--
> 1 | 2 | 3
> 1 | 2 | 4
> 1 | 2 | 5
>
> is unique on {a, b, c} but not on {a, b}, at least as I understand the
> way "unique" is used here, which is 3 distinct {a, b, c}, but only one
> {a, b}.
>
> Or I could be missing something obvious, and in that case, please
> ignore.

I think that example is the opposite direction of what David (Rowley)
is saying. Unique on {a, b} implies unique on {a, b, c} while you're
correct that the inverse doesn't hold.

Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
different from pathkeys.

James Coleman

Re: Index Skip Scan

From

David Rowley

Date:

03 July 2019, 22:06:11

On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331@gmail.com> wrote:
>
> On Wed, Jul 3, 2019 at 3:46 PM David Fetter <david@fetter.org> wrote:
> >
> > On Wed, Jul 03, 2019 at 12:27:09AM +1200, David Rowley wrote:
> > > On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> > >
> > > The more I think about these UniqueKeys, the more I think they need to
> > > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > > should be equivalent to { y, x }, but with PathKeys, that's not the
> > > case, since the order of each key matters. UniqueKeys equivalent
> > > version of pathkeys_contained_in() would not care about the order of
> > > individual keys, it would say things like, { a, b, c } is contained in
> > > { b, a }, since if the path is unique on columns { b, a } then it must
> > > also be unique on { a, b, c }.
> >
> > Is that actually true, though? I can see unique {a, b, c} => unique
> > {a, b}, but for example:
> >
> > a | b | c
> > --|---|--
> > 1 | 2 | 3
> > 1 | 2 | 4
> > 1 | 2 | 5
> >
> > is unique on {a, b, c} but not on {a, b}, at least as I understand the
> > way "unique" is used here, which is 3 distinct {a, b, c}, but only one
> > {a, b}.
> >
> > Or I could be missing something obvious, and in that case, please
> > ignore.
>
> I think that example is the opposite direction of what David (Rowley)
> is saying. Unique on {a, b} implies unique on {a, b, c} while you're
> correct that the inverse doesn't hold.
>
> Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
> and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
> different from pathkeys.

Yeah, exactly. A superset of the unique columns is still unique.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

David Fetter

Date:

04 July 2019, 00:41:18

On Thu, Jul 04, 2019 at 10:06:11AM +1200, David Rowley wrote:
> On Thu, 4 Jul 2019 at 09:02, James Coleman <jtc331@gmail.com> wrote:
> > I think that example is the opposite direction of what David (Rowley)
> > is saying. Unique on {a, b} implies unique on {a, b, c} while you're
> > correct that the inverse doesn't hold.
> >
> > Unique on {a, b} also implies unique on {b, a} as well as on {b, a, c}
> > and {c, a, b} and {c, b, a} and {a, c, b}, which is what makes this
> > different from pathkeys.
> 
> Yeah, exactly. A superset of the unique columns is still unique.

Thanks for clarifying!

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Index Skip Scan

From

Thomas Munro

Date:

04 July 2019, 10:59:47

On Wed, Jul 3, 2019 at 12:27 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On Tue, 2 Jul 2019 at 21:00, Thomas Munro <thomas.munro@gmail.com> wrote:
> > 2.  SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
> > you're allowed to go forwards.  Same for SELECT i, MAX(j) FROM t GROUP
> > BY i if you're allowed to go backwards.  Those queries are equivalent
> > to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
> > (though as Floris noted, the backwards version gives the wrong answers
> > with v20).  That does seem like a much more specific thing applicable
> > only to MIN and MAX, and I think preprocess_minmax_aggregates() could
> > be taught to handle that sort of query, building an index only scan
> > path with skip scan in build_minmax_path().
>
> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.

While updating the Loose Index Scan wiki page with links to other
products' terminology on this subject, I noticed that MySQL can
skip-scan MIN() and MAX() in the same query.  Hmm.  That seems quite
desirable.  I think it requires a new kind of skipping: I think you
have to be able to skip to the first AND last key that has each
distinct prefix, and then stick a regular agg on top to collapse them
into one row.  Such a path would not be so neatly describable by
UniqueKeys, or indeed by the amskip() interface in the current patch.
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.

-- 
Thomas Munro
https://enterprisedb.com

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

05 July 2019, 10:54:40

> > On Sat, Jun 22, 2019 at 12:17 PM Floris Van Nee <florisvannee@optiver.com> wrote:
> > The following sql statement seems to have incorrect results - some logic in
> > the backwards scan is currently not entirely right.
>
> Thanks for testing! You're right, looks like in the current implementation in
> case of backwards scan there is one unnecessary extra step forward. It seems
> this mistake was made, since I was concentrating only on the backward scans
> with a cursor, and used not exactly correct approach to wrap up after a scan
> was finished. Give me a moment, I'll tighten it up.

Here is finally a new version of the patch, where all the mentioned issues
seems to be fixed and the corresponding new tests should keep it like that
(I've skipped all the pubs at PostgresLondon for that). Also I've addressed the
most of feedback from Tomas, except the points about planning improvements
(which is still in our todo list). By no means it's a final result (e.g. I
guess `_bt_read_closest` must be improved), but I hope making progress
step-by-step will help anyway. Also I've fixed some, how it is popular to say
here, brain fade, where I mixed up scan directions.

> On Tue, Jul 2, 2019 at 11:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Jun 21, 2019 at 1:20 AM Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
> > Attached is v20, since the last patch should have been v19.
>
> I took this for a quick spin today.  The DISTINCT ON support is nice
> and I think it will be very useful.  I've signed up to review it and
> will have more to say later.  But today I had a couple of thoughts
> after looking into how src/backend/optimizer/plan/planagg.c works and
> wondering how to do some more skipping tricks with the existing
> machinery.
>
> On Thu, Jul 4, 2019 at 1:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I mention all this stuff not because I want us to run before we can
> walk, but because to be ready to commit the basic distinct skip scan
> feature, I think we should know approximately how it'll handle the
> future stuff we'll need.

Great, thank you! I agree with Jesper that probably some parts of this are
outside of scope for the first patch, but we definitely can take a look at what
needs to be done to make the current implementation more flexible, so the
follow up would be just natural.

Attachment

v21-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Jesper Pedersen

Date:

09 July 2019, 13:32:16

Hi,

On 7/4/19 6:59 AM, Thomas Munro wrote:
>> For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
>> }, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.
> 

David, are you thinking about something like the attached ?

Some questions.

* Do you see UniqueKey as a "complete" planner node ?
   - I didn't update the nodes/*.c files for this yet

* Is a UniqueKey with a list of EquivalenceClass best, or a list of 
UniqueKey with a single EquivalenceClass

Likely more questions around this coming -- should this be a separate 
thread ?

Based on this I'll start to update the v21 patch to use UniqueKey, and 
post a new version.

> While updating the Loose Index Scan wiki page with links to other
> products' terminology on this subject, I noticed that MySQL can
> skip-scan MIN() and MAX() in the same query.  Hmm.  That seems quite
> desirable.  I think it requires a new kind of skipping: I think you
> have to be able to skip to the first AND last key that has each
> distinct prefix, and then stick a regular agg on top to collapse them
> into one row.  Such a path would not be so neatly describable by
> UniqueKeys, or indeed by the amskip() interface in the current patch.
> I mention all this stuff not because I want us to run before we can
> walk, but because to be ready to commit the basic distinct skip scan
> feature, I think we should know approximately how it'll handle the
> future stuff we'll need.
> 

Thomas, do you have any ideas for this ? I can see that MySQL did the 
functionality in two change sets (base and function support), but like 
you said we shouldn't paint ourselves into a corner.

Feedback greatly appreciated.

Best regards,
  Jesper

Attachment

uniquekey.txt

Re: Index Skip Scan

From

Thomas Munro

Date:

10 July 2019, 02:14:06

On Wed, Jul 10, 2019 at 1:32 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> > While updating the Loose Index Scan wiki page with links to other
> > products' terminology on this subject, I noticed that MySQL can
> > skip-scan MIN() and MAX() in the same query.  Hmm.  That seems quite
> > desirable.  I think it requires a new kind of skipping: I think you
> > have to be able to skip to the first AND last key that has each
> > distinct prefix, and then stick a regular agg on top to collapse them
> > into one row.  Such a path would not be so neatly describable by
> > UniqueKeys, or indeed by the amskip() interface in the current patch.
> > I mention all this stuff not because I want us to run before we can
> > walk, but because to be ready to commit the basic distinct skip scan
> > feature, I think we should know approximately how it'll handle the
> > future stuff we'll need.
>
> Thomas, do you have any ideas for this ? I can see that MySQL did the
> functionality in two change sets (base and function support), but like
> you said we shouldn't paint ourselves into a corner.

I think amskip() could be augmented by later patches to take a
parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
SKIP_FIRST | SKIP_LAST (meaning please give me both).  What we have in
the current patch is SKIP_FIRST behaviour.  Being able to choose
SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
ORDER BY i (ie forward scan for the order, but wanting the highest key
for each distinct prefix), and being able to choose both allows for i,
MIN(j), MAX(j) GROUP BY i.  Earlier I thought that this scheme that
allows you to ask for both might be incompatible with David's
suggestion of tracking UniqueKeys in paths, but now I see that it
isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
and therefore require a regular agg on top.  I suspect that's OK: this
min/max transformation stuff is highly specialised and can have
whatever magic path-building logic is required in
preprocess_minmax_aggregates().

-- 
Thomas Munro
https://enterprisedb.com

Re: Index Skip Scan

From

Jesper Pedersen

Date:

10 July 2019, 13:10:36

Hi,

On 7/9/19 10:14 PM, Thomas Munro wrote:
>> Thomas, do you have any ideas for this ? I can see that MySQL did the
>> functionality in two change sets (base and function support), but like
>> you said we shouldn't paint ourselves into a corner.
> 
> I think amskip() could be augmented by later patches to take a
> parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
> SKIP_FIRST | SKIP_LAST (meaning please give me both).  What we have in
> the current patch is SKIP_FIRST behaviour.  Being able to choose
> SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
> ORDER BY i (ie forward scan for the order, but wanting the highest key
> for each distinct prefix), and being able to choose both allows for i,
> MIN(j), MAX(j) GROUP BY i.  Earlier I thought that this scheme that
> allows you to ask for both might be incompatible with David's
> suggestion of tracking UniqueKeys in paths, but now I see that it
> isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
> and therefore require a regular agg on top.  I suspect that's OK: this
> min/max transformation stuff is highly specialised and can have
> whatever magic path-building logic is required in
> preprocess_minmax_aggregates().
> 

Ok, great.

Thanks for your feedback !

Best regards,
  Jesper

Re: Index Skip Scan

From

Floris Van Nee

Date:

10 July 2019, 14:39:55

> Here is finally a new version of the patch, where all the mentioned issues

> seems to be fixed and the corresponding new tests should keep it like that
> (I've skipped all the pubs at PostgresLondon for that).

Thanks for the new patch! Really appreciate the work you're putting into it.

I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I think the cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get this working properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There are basically four cases:

- Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to the next prefix

- Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to the previous prefix

- Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to the next prefix

These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this thread about being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll encounter the same issues.

Also, I think in planner.c, line 4831, we should actually be looking at whether uniq_distinct_pathkeys is NIL, rather than the current check for distinct_pathkeys. That'll make the planner pick the skip scan even with queries like "select distinct on (a) a,b where a=2". Currently, it doesn't pick the skip scan here, beacuse distinct_pathkeys does not contain "a" anymore. This leads to it scanning every item for a=2 even though it can stop after the first one.

I'll do some more tests with the patch.

-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

10 July 2019, 14:48:43

> On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> The more I think about these UniqueKeys, the more I think they need to
> be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> should be equivalent to { y, x }, but with PathKeys, that's not the
> case, since the order of each key matters. UniqueKeys equivalent
> version of pathkeys_contained_in() would not care about the order of
> individual keys, it would say things like, { a, b, c } is contained in
> { b, a }, since if the path is unique on columns { b, a } then it must
> also be unique on { a, b, c }.

> On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> David, are you thinking about something like the attached ?
>
> Some questions.
>
> * Do you see UniqueKey as a "complete" planner node ?
>    - I didn't update the nodes/*.c files for this yet
>
> * Is a UniqueKey with a list of EquivalenceClass best, or a list of
> UniqueKey with a single EquivalenceClass

Just for me to clarify, the idea is to replace PathKeys with a new concept of
"UniqueKey" for skip scans, right? If I see it correctly, of course

    UniqueKeys { x, y } == UniqueKeys { y, x }

from the result point of view, but the execution costs could be different due
to different values distribution. In fact there are efforts to utilize this to
produce more optimal order [1], but with UniqueKeys concept this information is
lost. Obviously it's not something that could be immediately (or even never) a
problem for skip scan functionality, but I guess it's still worth to point it
out.

> On Wed, Jul 10, 2019 at 4:40 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>
> I verified that the backwards index scan is indeed functioning now. However,
> I'm afraid it's not that simple, as I think the cursor case is broken now.

Thanks for testing! Could you provide a test case to show what exactly is the
problem?

[1]: https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru

Re: Index Skip Scan

From

Floris Van Nee

Date:

10 July 2019, 14:52:21

> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;

set enable_indexskipscan=1; // setting this to 0 yields different results
set random_page_cost=1;
explain SELECT DISTINCT ON (a) a,b FROM a;

BEGIN;
DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;

FETCH FROM c;
FETCH BACKWARD FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

FETCH 6 FROM c;
FETCH BACKWARD 6 FROM c;

END;

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

10 July 2019, 15:00:10

> On Wed, Jul 10, 2019 at 4:52 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>
> > Thanks for testing! Could you provide a test case to show what exactly is the
> > problem?
>
> create table a (a int, b int, c int);
> insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
> create index on a (a,b);
> analyze a;
>
> set enable_indexskipscan=1; // setting this to 0 yields different results
> set random_page_cost=1;
> explain SELECT DISTINCT ON (a) a,b FROM a;
>
> BEGIN;
> DECLARE c SCROLL CURSOR FOR SELECT DISTINCT ON (a) a,b FROM a;
>
> FETCH FROM c;
> FETCH BACKWARD FROM c;
>
> FETCH 6 FROM c;
> FETCH BACKWARD 6 FROM c;
>
> FETCH 6 FROM c;
> FETCH BACKWARD 6 FROM c;
>
> END;

Ok, give me a moment, I'll check.

Re: Index Skip Scan

From

Floris Van Nee

Date:

10 July 2019, 15:00:23

> Thanks for testing! Could you provide a test case to show what exactly is the
> problem?

Note that in the case of a regular non-skip scan, this cursor backwards works because the Unique node on top does not
supportbackwards scanning at all. Therefore, when creating the cursor, the actual plan actually contains a Materialize
nodeon top of the Unique+Index Scan nodes. The 'fetch backwards' never reaches the the index scan therefore, as it just
fetchesstuff from the materialize node. 

-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

10 July 2019, 15:05:49

> On Wed, Jul 10, 2019 at 5:00 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>
>
> > Thanks for testing! Could you provide a test case to show what exactly is the
> > problem?
>
> Note that in the case of a regular non-skip scan, this cursor backwards works
> because the Unique node on top does not support backwards scanning at all.
> Therefore, when creating the cursor, the actual plan actually contains a
> Materialize node on top of the Unique+Index Scan nodes. The 'fetch backwards'
> never reaches the the index scan therefore, as it just fetches stuff from the
> materialize node.

Yeah, I'm aware. The last time when I was busy with cursors I've managed to
make it work as I wanted, so at that time I decided to keep it like that, even
though without skip scan it wasn't doing backwards.

Re: Index Skip Scan

From

Thomas Munro

Date:

11 July 2019, 02:49:21

On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote:
> I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I
thinkthe cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get
thisworking properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There
arebasically four cases: 
>
> - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to
thenext prefix 
>
> - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward' to
theprevious prefix 
>
> - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward' to
theprevious prefix 
>
> - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward' to
thenext prefix 
>
> These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this
threadabout being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll
encounterthe same issues. 

Oh, right!  So actually we already need the extra SKIP_FIRST/SKIP_LAST
argument to amskip() that I theorised about, to support DISTINCT ON.
Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or
SKIP_MIN/SKIP_MAX.  If we don't add support for that, we'll have to
drop DISTINCT ON support, or use Materialize for some cases.  My vote
is: let's move forwards and add that parameter and make DISTINCT ON
work.


--
Thomas Munro
https://enterprisedb.com

Re: Index Skip Scan

From

David Rowley

Date:

11 July 2019, 05:16:35

On Thu, 11 Jul 2019 at 14:50, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote:
> > I verified that the backwards index scan is indeed functioning now. However, I'm afraid it's not that simple, as I
thinkthe cursor case is broken now. I think having just the 'scan direction' in the btree code is not enough to get
thisworking properly, because we need to know whether we want the minimum or maximum element of a certain prefix. There
arebasically four cases: 
> >
> > - Forward Index Scan + Forward cursor: we want the minimum element within a prefix and we want to skip 'forward' to
thenext prefix 
> >
> > - Forward Index Scan + Backward cursor: we want the minimum element within a prefix and we want to skip 'backward'
tothe previous prefix 
> >
> > - Backward Index Scan + Forward cursor: we want the maximum element within a prefix and we want to skip 'backward'
tothe previous prefix 
> >
> > - Backward Index Scan + Backward cursor: we want the maximum element within a prefix and we want to skip 'forward'
tothe next prefix 
> >
> > These cases make it rather complicated unfortunately. They do somewhat tie in with the previous discussion on this
threadabout being able to skip to the min or max value. If we ever want to support a sort of minmax scan, we'll
encounterthe same issues. 
>
> Oh, right!  So actually we already need the extra SKIP_FIRST/SKIP_LAST
> argument to amskip() that I theorised about, to support DISTINCT ON.
> Or I guess it could be modelled as SKIP_LOW/SKIP_HIGH or
> SKIP_MIN/SKIP_MAX.  If we don't add support for that, we'll have to
> drop DISTINCT ON support, or use Materialize for some cases.  My vote
> is: let's move forwards and add that parameter and make DISTINCT ON
> work.

Does it not just need to know the current direction of the cursor's
scroll, then also the intended scan direction?

For the general forward direction but for a backwards cursor scroll,
we'd return the lowest value for each distinct prefix, but for the
general backwards direction (DESC case) we'd return the highest value
for each distinct prefix. Looking at IndexNext() the cursor direction
seems to be estate->es_direction and the general scan direction is
indicated by the plan's indexorderdir. Can't we just pass both of
those to index_skip() to have it decide what to do? If we also pass in
indexorderdir then index_skip() should know if it's to return the
highest or lowest value, right?

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

Floris Van Nee

Date:

11 July 2019, 07:41:31

> For the general forward direction but for a backwards cursor scroll,

> we'd return the lowest value for each distinct prefix, but for the
> general backwards direction (DESC case) we'd return the highest value
> for each distinct prefix. Looking at IndexNext() the cursor direction
> seems to be estate->es_direction and the general scan direction is
> indicated by the plan's indexorderdir. Can't we just pass both of
> those to index_skip() to have it decide what to do? If we also pass in
> indexorderdir then index_skip() should know if it's to return the
> highest or lowest value, right?

Correct, with these two values correct behavior can be deduced. The implementation of this is a bit cumbersome though. Consider a case like:

SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)

Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping):

1,1,1

1,1,2

1,2,2

1,2,3

2,2,1

2,2,3

3,1,1

3,1,2

3,2,2

3,2,3

Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward one-by-one until it finds a match. This match only comes at (3,1,2) however. Then it skips to the end.

If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible implementation would start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward one-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop as soon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan forward to find (1,1,2).

Perhaps anyone can think of an easier way to implement it?

I do think being able to use DISTINCT ON is very useful and it's worth the extra complications. In the future we can add even more useful skipping features to it, for example:
SELECT DISTINCT ON (a) * FROM a WHERE b =2

After skipping to the next prefix of column a, we can start a new search for (a,b)=(prefix,2) to avoid having to move one-by-one from the start of the prefix to the first matching element. There are many other useful optimizations possible. That won't have to be for this patch though :-)

-Floris

Re: Index Skip Scan

From

David Rowley

Date:

11 July 2019, 10:12:37

On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee@optiver.com> wrote:
> SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
> Data (imagine every tuple here actually occurs 10.000 times in the index to see the benefit of skipping):
> 1,1,1
> 1,1,2
> 1,2,2
> 1,2,3
> 2,2,1
> 2,2,3
> 3,1,1
> 3,1,2
> 3,2,2
> 3,2,3
>
> Creating a cursor on this query and then moving forward, you should get (1,1,2), (3,1,2). In the current
implementationof the patch, after bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves forward
one-by-oneuntil it finds a match. This match only comes at (3,1,2) however. Then it skips to the end. 
>
> If you move the cursor backwards from the end of the cursor, you should still get (3,1,2) (1,1,2). A possible
implementationwould start at the end and do a skip to the beginning of the prefix: (3,1,1). Then it needs to move
forwardone-by-one in order to find the first matching (minimum) item (3,1,2). When it finds it, it needs to skip
backwardsto the beginning of prefix 2 (2,2,1). It needs to move forwards to find the minimum element, but should stop
assoon as it detects that the prefix doesn't match anymore (because there is no match for prefix 2, it will move all
theway from (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan
forwardto find (1,1,2). 
> Perhaps anyone can think of an easier way to implement it?

One option is just don't implement it and just change
ExecSupportsBackwardScan() so that it returns false for skip index
scans, or perhaps at least implement an index am method to allow the
planner to be able to determine if the index implementation supports
it... amcanskipbackward

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

David Rowley

Date:

11 July 2019, 11:38:19

On Thu, 11 Jul 2019 at 02:48, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Tue, Jul 2, 2019 at 2:27 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
> >
> > The more I think about these UniqueKeys, the more I think they need to
> > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > should be equivalent to { y, x }, but with PathKeys, that's not the
> > case, since the order of each key matters. UniqueKeys equivalent
> > version of pathkeys_contained_in() would not care about the order of
> > individual keys, it would say things like, { a, b, c } is contained in
> > { b, a }, since if the path is unique on columns { b, a } then it must
> > also be unique on { a, b, c }.
>
> > On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> >
> > David, are you thinking about something like the attached ?
> >
> > Some questions.
> >
> > * Do you see UniqueKey as a "complete" planner node ?
> >    - I didn't update the nodes/*.c files for this yet
> >
> > * Is a UniqueKey with a list of EquivalenceClass best, or a list of
> > UniqueKey with a single EquivalenceClass
>
> Just for me to clarify, the idea is to replace PathKeys with a new concept of
> "UniqueKey" for skip scans, right? If I see it correctly, of course
>
>     UniqueKeys { x, y } == UniqueKeys { y, x }
>
> from the result point of view, but the execution costs could be different due
> to different values distribution. In fact there are efforts to utilize this to
> produce more optimal order [1], but with UniqueKeys concept this information is
> lost. Obviously it's not something that could be immediately (or even never) a
> problem for skip scan functionality, but I guess it's still worth to point it
> out.

The UniqueKeys idea is quite separate from pathkeys.  Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined.  Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }.  That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version.  This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

Jesper Pedersen

Date:

11 July 2019, 16:53:54

Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:
> The UniqueKeys idea is quite separate from pathkeys.  Currently, a
> Path can have a List of PathKeys which define the order that the
> tuples will be read from the Plan node that's created from that Path.
> The idea with UniqueKeys is that a Path can also have a non-empty List
> of UniqueKeys to define that there will be no more than 1 row with the
> same value for the Paths UniqueKey column/exprs.
> 
> As of now, if you look at standard_qp_callback() sets
> root->query_pathkeys, the idea here would be that the callback would
> also set a new List field named "query_uniquekeys" based on the
> group_pathkeys when non-empty and !root->query->hasAggs, or by using
> the distinct clause if it's non-empty. Then in build_index_paths()
> around the call to match_pathkeys_to_index() we'll probably also want
> to check if the index can support UniqueKeys that would suit the
> query_uniquekeys that were set during standard_qp_callback().
> 
> As for setting query_uniquekeys in standard_qp_callback(), this should
> be simply a matter of looping over either group_pathkeys or
> distinct_pathkeys and grabbing the pk_eclass from each key and making
> a canonical UniqueKey from that. To have these canonical you'll need
> to have a new field in PlannerInfo named canon_uniquekeys which will
> do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
> need an equivalent of make_canonical_pathkey() in uniquekey.c
> 
> With this implementation, the code that the patch adds in
> create_distinct_paths() can mostly disappear. You'd only need to look
> for a path that uniquekeys_contained_in() matches the
> root->query_uniquekeys and then just leave it to
> set_cheapest(distinct_rel); to pick the cheapest path.
> 
> It would be wasted effort to create paths with UniqueKeys if there's
> multiple non-dead base rels in the query as the final rel in
> create_distinct_paths would be a join rel, so it might be worth
> checking that before creating such index paths in build_index_paths().
> However, down the line, we could carry the uniquekeys forward into the
> join paths if the join does not duplicate rows from the other side of
> the join... That's future stuff though, not for this patch, I don't
> think.
> 
> I think a UniqueKey can just be a struct similar to PathKey, e.g be
> located in pathnodes.h around where PathKey is defined.  Likely we'll
> need a uniquekeys.c file that has the equivalent to
> pathkeys_contained_in() ... uniquekeys_contained_in(), which return
> true if arg1 is a superset of arg2 rather than check for one set being
> a prefix of another. As you mention above: UniqueKeys { x, y } ==
> UniqueKeys { y, x }.  That superset check could perhaps be optimized
> by sorting UniqueKey lists in memory address order, that'll save
> having a nested loop, but likely that's not going to be required for a
> first cut version.  This would work since you'd want UniqueKeys to be
> canonical the same as PathKeys are (Notice that compare_pathkeys()
> only checks memory addresses of pathkeys and not equals()).
> 
> I think the UniqueKey struct would only need to contain an
> EquivalenceClass field. I think all the other stuff that's in PathKey
> is irrelevant to UniqueKey.
> 

Thanks for the feedback ! I'll work on these changes for the next 
uniquekey patch.

Best regards,
  Jesper

Re: Index Skip Scan

From

Jesper Pedersen

Date:

16 July 2019, 16:53:53

Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:
> The UniqueKeys idea is quite separate from pathkeys.  Currently, a
> Path can have a List of PathKeys which define the order that the
> tuples will be read from the Plan node that's created from that Path.
> The idea with UniqueKeys is that a Path can also have a non-empty List
> of UniqueKeys to define that there will be no more than 1 row with the
> same value for the Paths UniqueKey column/exprs.
> 
> As of now, if you look at standard_qp_callback() sets
> root->query_pathkeys, the idea here would be that the callback would
> also set a new List field named "query_uniquekeys" based on the
> group_pathkeys when non-empty and !root->query->hasAggs, or by using
> the distinct clause if it's non-empty. Then in build_index_paths()
> around the call to match_pathkeys_to_index() we'll probably also want
> to check if the index can support UniqueKeys that would suit the
> query_uniquekeys that were set during standard_qp_callback().
> 
> As for setting query_uniquekeys in standard_qp_callback(), this should
> be simply a matter of looping over either group_pathkeys or
> distinct_pathkeys and grabbing the pk_eclass from each key and making
> a canonical UniqueKey from that. To have these canonical you'll need
> to have a new field in PlannerInfo named canon_uniquekeys which will
> do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
> need an equivalent of make_canonical_pathkey() in uniquekey.c
> 
> With this implementation, the code that the patch adds in
> create_distinct_paths() can mostly disappear. You'd only need to look
> for a path that uniquekeys_contained_in() matches the
> root->query_uniquekeys and then just leave it to
> set_cheapest(distinct_rel); to pick the cheapest path.
> 
> It would be wasted effort to create paths with UniqueKeys if there's
> multiple non-dead base rels in the query as the final rel in
> create_distinct_paths would be a join rel, so it might be worth
> checking that before creating such index paths in build_index_paths().
> However, down the line, we could carry the uniquekeys forward into the
> join paths if the join does not duplicate rows from the other side of
> the join... That's future stuff though, not for this patch, I don't
> think.
> 
> I think a UniqueKey can just be a struct similar to PathKey, e.g be
> located in pathnodes.h around where PathKey is defined.  Likely we'll
> need a uniquekeys.c file that has the equivalent to
> pathkeys_contained_in() ... uniquekeys_contained_in(), which return
> true if arg1 is a superset of arg2 rather than check for one set being
> a prefix of another. As you mention above: UniqueKeys { x, y } ==
> UniqueKeys { y, x }.  That superset check could perhaps be optimized
> by sorting UniqueKey lists in memory address order, that'll save
> having a nested loop, but likely that's not going to be required for a
> first cut version.  This would work since you'd want UniqueKeys to be
> canonical the same as PathKeys are (Notice that compare_pathkeys()
> only checks memory addresses of pathkeys and not equals()).
> 
> I think the UniqueKey struct would only need to contain an
> EquivalenceClass field. I think all the other stuff that's in PathKey
> is irrelevant to UniqueKey.
> 

Here is a patch more in that direction.

Some questions:

1) Do we really need the UniqueKey struct ?  If it only contains the 
EquivalenceClass field then we could just have a list of those instead. 
That would make the patch simpler.

2) Do we need both canon_uniquekeys and query_uniquekeys ?  Currently 
the patch only uses canon_uniquekeys because the we attach the list 
directly on the Path node.

I'll clean the patch up based on your feedback, and then start to rebase 
v21 on it.

Thanks in advance !

Best regards,
  Jesper

Attachment

v2_uniquekey.txt

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

20 July 2019, 18:30:42

>  On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee <florisvannee@optiver.com> wrote:
> I verified that the backwards index scan is indeed functioning now. However,
> I'm afraid it's not that simple, as I think the cursor case is broken now. I
> think having just the 'scan direction' in the btree code is not enough to get
> this working properly, because we need to know whether we want the minimum or
> maximum element of a certain prefix. There are basically four cases:
>
> - Forward Index Scan + Forward cursor: we want the minimum element within a
>   prefix and we want to skip 'forward' to the next prefix
>
> - Forward Index Scan + Backward cursor: we want the minimum element within a
>   prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Forward cursor: we want the maximum element within a
>   prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Backward cursor: we want the maximum element within a
>   prefix and we want to skip 'forward' to the next prefix
>
> These cases make it rather complicated unfortunately. They do somewhat tie in
> with the previous discussion on this thread about being able to skip to the
> min or max value. If we ever want to support a sort of minmax scan, we'll
> encounter the same issues.

Yes, these four cases are indeed a very good point. I've prepared a new version
of the patch, where they + an index condition and handling of situations when
it eliminated one or more unique elements are addressed. It seems fixes issues
and works also for those hypothetical examples you've mentioned above, but of
course it looks pretty complicated and I need to polish it a bit before
posting.

> On Thu, Jul 11, 2019 at 12:13 PM David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On Thu, 11 Jul 2019 at 19:41, Floris Van Nee <florisvannee@optiver.com> wrote:
> > SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
> > Data (imagine every tuple here actually occurs 10.000 times in the index to
> > see the benefit of skipping):
> > 1,1,1
> > 1,1,2
> > 1,2,2
> > 1,2,3
> > 2,2,1
> > 2,2,3
> > 3,1,1
> > 3,1,2
> > 3,2,2
> > 3,2,3
> >
> > Creating a cursor on this query and then moving forward, you should get
> > (1,1,2), (3,1,2). In the current implementation of the patch, after
> > bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves
> > forward one-by-one until it finds a match. This match only comes at (3,1,2)
> > however. Then it skips to the end.
> >
> > If you move the cursor backwards from the end of the cursor, you should
> > still get (3,1,2) (1,1,2). A possible implementation would start at the end
> > and do a skip to the beginning of the prefix: (3,1,1). Then it needs to
> > move forward one-by-one in order to find the first matching (minimum) item
> > (3,1,2). When it finds it, it needs to skip backwards to the beginning of
> > prefix 2 (2,2,1). It needs to move forwards to find the minimum element,
> > but should stop as soon as it detects that the prefix doesn't match anymore
> > (because there is no match for prefix 2, it will move all the way from
> > (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of
> > prefix 1: (1,1,1) and scan forward to find (1,1,2).
> > Perhaps anyone can think of an easier way to implement it?
>
> One option is just don't implement it and just change
> ExecSupportsBackwardScan() so that it returns false for skip index
> scans, or perhaps at least implement an index am method to allow the
> planner to be able to determine if the index implementation supports
> it... amcanskipbackward

Yep, it was discussed few times in this thread, and after we've discovered
(thanks to Floris) so many issues I was also one step away from implementing
this idea. But at the time time as Thomas correctly noticed, our implementation
needs to be extensible to handle future use cases, and this particular cursor
juggling seems already like a pretty good example of such "future use case". So
I hope by dealing with it we can also figure out what needs to be extensible.

> On Tue, Jul 16, 2019 at 6:53 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> Here is a patch more in that direction.
>
> Some questions:
>
> 1) Do we really need the UniqueKey struct ?  If it only contains the
> EquivalenceClass field then we could just have a list of those instead.
> That would make the patch simpler.
>
> 2) Do we need both canon_uniquekeys and query_uniquekeys ?  Currently
> the patch only uses canon_uniquekeys because the we attach the list
> directly on the Path node.
>
> I'll clean the patch up based on your feedback, and then start to rebase
> v21 on it.

Thanks! I'll also take a look as soon an I'm finished with the last updates.

Re: Index Skip Scan

From

David Rowley

Date:

22 July 2019, 05:44:54

On Wed, 17 Jul 2019 at 04:53, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Here is a patch more in that direction.

Thanks. I've just had a look over this and it roughly what I have in mind.

Here are the comments I noted down during the review:

cost_index:

I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.

For example, in a query such as:

SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;

you only need to perform estimate_num_groups() on "y".

I'm really not quite sure on what exactly will be required from
amcostestimate() here.  The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.

build_index_paths:

I don't quite see where you're checking if the query's unique_keys
match what unique keys can be produced by the index.  This is done for
pathkeys with:

pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
!found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
{
index_pathkeys = build_index_pathkeys(root, index,
  ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);
orderbyclauses = NIL;
orderbyclausecols = NIL;
}

Here has_useful_pathkeys() checks if the query requires some ordering.
For unique keys you'll want to do the same. You'll have set the
query's unique key requirements in standard_qp_callback().

I think basically build_index_paths() should be building index paths
with unique keys, for all indexes that can support the query's unique
keys. I'm just a bit uncertain if we need to create both a normal
index path and another path for the same index with unique keys.
Perhaps we can figure that out down the line somewhere. Maybe it's
best to build path types for now, when possible, and we can consider
later if we can skip the non-uniquekey paths.  Likely that would
require a big XXX comment to explain we need to review that before the
code makes it into core.

get_uniquekeys_for_index:

I think this needs to follow more the lead from build_index_pathkeys.
Basically, ask the index what it's pathkeys are.

standard_qp_callback:

build_group_uniquekeys & build_distinct_uniquekeys could likely be one
function that takes a list of SortGroupClause. You just either pass
the groupClause or distinctClause in.  Pretty much the UniqueKey
version of make_pathkeys_for_sortclauses().

> Some questions:
>
> 1) Do we really need the UniqueKey struct ?  If it only contains the
> EquivalenceClass field then we could just have a list of those instead.
> That would make the patch simpler.

I dunno about that. I understand it looks a bit pointless due to just
having one field, but perhaps we can worry about that later. If we
choose to ditch it and replace it with just an EquivalenceClass then
we can do that later.

> 2) Do we need both canon_uniquekeys and query_uniquekeys ?  Currently
> the patch only uses canon_uniquekeys because the we attach the list
> directly on the Path node.

canon_uniquekeys should store at most one UniqueKey per
EquivalenceClass. The reason for this is for fast comparison. We can
compare memory addresses rather than checking individual fields are
equal. Now... yeah it's true that there is only one field so far and
we could just check the pointers are equal on the EquivalenceClasses,
but I think maybe this is in the same boat as #1. Let's do it for now
so we're sticking as close to the guidelines laid out by PathKeys and
once it's all working and plugged into skip scans then we can decide
if it needs a simplification pass over the code.

> I'll clean the patch up based on your feedback, and then start to rebase
> v21 on it.

Cool.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Index Skip Scan

From

Jesper Pedersen

Date:

22 July 2019, 17:10:47

Hi,

On 7/22/19 1:44 AM, David Rowley wrote:
> Here are the comments I noted down during the review:
> 
> cost_index:
> 
> I know you've not finished here, but I think it'll need to adjust
> tuples_fetched somehow to account for estimate_num_groups() on the
> Path's unique keys. Any Eclass with an ec_has_const = true does not
> need to be part of the estimate there as there can only be at most one
> value for these.
> 
> For example, in a query such as:
> 
> SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> 
> you only need to perform estimate_num_groups() on "y".
> 
> I'm really not quite sure on what exactly will be required from
> amcostestimate() here.  The cost of the skip scan is not the same as
> the normal scan. So other that API needs adjusted to allow the caller
> to mention that we want skip scans estimated, or there needs to be
> another callback.
> 

I think this part will become more clear once the index skip scan patch 
is rebased, as we got the uniquekeys field on the Path, and the 
indexskipprefixy info on the IndexPath node.


> build_index_paths:
> 
> I don't quite see where you're checking if the query's unique_keys
> match what unique keys can be produced by the index.  This is done for
> pathkeys with:
> 
> pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
> !found_lower_saop_clause &&
> has_useful_pathkeys(root, rel));
> index_is_ordered = (index->sortopfamily != NULL);
> if (index_is_ordered && pathkeys_possibly_useful)
> {
> index_pathkeys = build_index_pathkeys(root, index,
>    ForwardScanDirection);
> useful_pathkeys = truncate_useless_pathkeys(root, rel,
> index_pathkeys);
> orderbyclauses = NIL;
> orderbyclausecols = NIL;
> }
> 
> Here has_useful_pathkeys() checks if the query requires some ordering.
> For unique keys you'll want to do the same. You'll have set the
> query's unique key requirements in standard_qp_callback().
> 
> I think basically build_index_paths() should be building index paths
> with unique keys, for all indexes that can support the query's unique
> keys. I'm just a bit uncertain if we need to create both a normal
> index path and another path for the same index with unique keys.
> Perhaps we can figure that out down the line somewhere. Maybe it's
> best to build path types for now, when possible, and we can consider
> later if we can skip the non-uniquekey paths.  Likely that would
> require a big XXX comment to explain we need to review that before the
> code makes it into core.
> 
> get_uniquekeys_for_index:
> 
> I think this needs to follow more the lead from build_index_pathkeys.
> Basically, ask the index what it's pathkeys are.
> 
> standard_qp_callback:
> 
> build_group_uniquekeys & build_distinct_uniquekeys could likely be one
> function that takes a list of SortGroupClause. You just either pass
> the groupClause or distinctClause in.  Pretty much the UniqueKey
> version of make_pathkeys_for_sortclauses().
> 
Yeah, I'll move this part of the index skip scan patch to the unique key 
patch.

Thanks for your feedback !

Best regards,
  Jesper

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

24 July 2019, 20:49:32

> On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>
> On 7/22/19 1:44 AM, David Rowley wrote:
> > Here are the comments I noted down during the review:
> >
> > cost_index:
> >
> > I know you've not finished here, but I think it'll need to adjust
> > tuples_fetched somehow to account for estimate_num_groups() on the
> > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > need to be part of the estimate there as there can only be at most one
> > value for these.
> >
> > For example, in a query such as:
> >
> > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> >
> > you only need to perform estimate_num_groups() on "y".
> >
> > I'm really not quite sure on what exactly will be required from
> > amcostestimate() here.  The cost of the skip scan is not the same as
> > the normal scan. So other that API needs adjusted to allow the caller
> > to mention that we want skip scans estimated, or there needs to be
> > another callback.
> >
>
> I think this part will become more clear once the index skip scan patch
> is rebased, as we got the uniquekeys field on the Path, and the
> indexskipprefixy info on the IndexPath node.

Here is what I came up with to address the problems, mentioned above in this
thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
that passes an index condition is few pages away - I'll try to tackle it soon).
Just another small step forward, but I hope it's enough to rebase on top of it
planner changes.

Also I've added few tags, mostly to mention reviewers contribution.

Attachment

v22-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Kyotaro Horiguchi

Date:

25 July 2019, 11:17:37

Hello.

At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com>
> > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> >
> > On 7/22/19 1:44 AM, David Rowley wrote:
> > > Here are the comments I noted down during the review:
> > >
> > > cost_index:
> > >
> > > I know you've not finished here, but I think it'll need to adjust
> > > tuples_fetched somehow to account for estimate_num_groups() on the
> > > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > > need to be part of the estimate there as there can only be at most one
> > > value for these.
> > >
> > > For example, in a query such as:
> > >
> > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> > >
> > > you only need to perform estimate_num_groups() on "y".
> > >
> > > I'm really not quite sure on what exactly will be required from
> > > amcostestimate() here.  The cost of the skip scan is not the same as
> > > the normal scan. So other that API needs adjusted to allow the caller
> > > to mention that we want skip scans estimated, or there needs to be
> > > another callback.
> > >
> >
> > I think this part will become more clear once the index skip scan patch
> > is rebased, as we got the uniquekeys field on the Path, and the
> > indexskipprefixy info on the IndexPath node.
> 
> Here is what I came up with to address the problems, mentioned above in this
> thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
> occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
> that passes an index condition is few pages away - I'll try to tackle it soon).
> Just another small step forward, but I hope it's enough to rebase on top of it
> planner changes.
> 
> Also I've added few tags, mostly to mention reviewers contribution.

I have some comments.

+           * The order of columns in the index should be the same, as for
+           * unique distincs pathkeys, otherwise we cannot use _bt_search
+           * in the skip implementation - this can lead to a missing
+           * records.

It seems that it is enough that distinct pathkeys is contained in
index pathkeys. If it's right, that is almost checked in existing
code:

> if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))

It is perfect when needed_pathkeys is distinct_pathkeys.  So
additional check is required if and only if it is not the case.

>        if (enable_indexskipscan &&
>          IsA(path, IndexPath) &&
>          ((IndexPath *) path)->indexinfo->amcanskip &&
>          (path->pathtype == T_IndexOnlyScan ||
>           path->pathtype == T_IndexScan) &&
>          (needed_pathkeys == root->distinct_pathkeys ||
>           pathkeys_contained_in(root->distinct_pathkeys,
>                       path->pathkeys)))

path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
the check against them isn't needed. If you have concern on that,
we can add that as Assert().

I feel uncomfortable to look into indexinfo there. Couldnd't we
use indexskipprefix == -1 to signal !amcanskip from
create_index_path?


+          /*
+           * XXX: In case of index scan quals evaluation happens after
+           * ExecScanFetch, which means skip results could be fitered out
+           */

Why can't we use skipscan path if having filter condition?  If
something bad happens, the reason must be written here instead of
what we do.


+          if (path->pathtype == T_IndexScan &&
+            parse->jointree != NULL &&
+            parse->jointree->quals != NULL &&
+            ((List *)parse->jointree->quals)->length != 0)

It's better to use list_length instead of peeping inside. It
handles the NULL case as well. (The structure has recently
changed but .length is not, though.)


+     * If advancing direction is different from index direction, we must
+     * skip right away, but _bt_skip requires a starting point.

It doesn't seem needed to me. Could you elaborate on the reason
for that?


+     * If advancing direction is different from index direction, we must
+     * skip right away, but _bt_skip requires a starting point.
+     */
+    if (direction * indexonlyscan->indexorderdir < 0 &&
+      !node->ioss_FirstTupleEmitted)

I'm confused by this. "direction" there is the physical scan
direction (fwd/bwd) of index scan, which is already compensated
by indexorderdir. Thus the condition means we do that when
logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
"index direction" exactly means...)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Kyotaro Horiguchi

Date:

25 July 2019, 11:29:50

Sorry, there's a too-hard-to-read part.

At Thu, 25 Jul 2019 20:17:37 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190725.201737.192223037.horikyota.ntt@gmail.com>
> Hello.
> 
> At Wed, 24 Jul 2019 22:49:32 +0200, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcXgwDMiowOGbr7gimTY3NV-LbcwP=rbma_L56pc+9p1Xw@mail.gmail.com>
> > > On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> > >
> > > On 7/22/19 1:44 AM, David Rowley wrote:
> > > > Here are the comments I noted down during the review:
> > > >
> > > > cost_index:
> > > >
> > > > I know you've not finished here, but I think it'll need to adjust
> > > > tuples_fetched somehow to account for estimate_num_groups() on the
> > > > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > > > need to be part of the estimate there as there can only be at most one
> > > > value for these.
> > > >
> > > > For example, in a query such as:
> > > >
> > > > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> > > >
> > > > you only need to perform estimate_num_groups() on "y".
> > > >
> > > > I'm really not quite sure on what exactly will be required from
> > > > amcostestimate() here.  The cost of the skip scan is not the same as
> > > > the normal scan. So other that API needs adjusted to allow the caller
> > > > to mention that we want skip scans estimated, or there needs to be
> > > > another callback.
> > > >
> > >
> > > I think this part will become more clear once the index skip scan patch
> > > is rebased, as we got the uniquekeys field on the Path, and the
> > > indexskipprefixy info on the IndexPath node.
> > 
> > Here is what I came up with to address the problems, mentioned above in this
> > thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
> > occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
> > that passes an index condition is few pages away - I'll try to tackle it soon).
> > Just another small step forward, but I hope it's enough to rebase on top of it
> > planner changes.
> > 
> > Also I've added few tags, mostly to mention reviewers contribution.
> 
> I have some comments.
> 
> +           * The order of columns in the index should be the same, as for
> +           * unique distincs pathkeys, otherwise we cannot use _bt_search
> +           * in the skip implementation - this can lead to a missing
> +           * records.
> 
> It seems that it is enough that distinct pathkeys is contained in
> index pathkeys. If it's right, that is almost checked in existing
> code:
> 
> > if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
> 
> It is perfect when needed_pathkeys is distinct_pathkeys.  So
> additional check is required if and only if it is not the case.

So I think the following change will work.

- +        /* Consider index skip scan as well */
- +        if (enable_indexskipscan &&
- +          IsA(path, IndexPath) &&
- +          ((IndexPath *) path)->indexinfo->amcanskip &&
- +          (path->pathtype == T_IndexOnlyScan ||
- +           path->pathtype == T_IndexScan) &&
- +          root->distinct_pathkeys != NIL)

+ +        if (enable_indexskipscan &&
+ +          IsA(path, IndexPath) &&
+ +          ((IndexPath *) path)->indexskipprefix >= 0 &&
+ +          (needed_pathkeys == root->distinct_pathkeys ||
+ +           pathkeys_contained_in(root->distinct_pathkeys,
+ +                       path->pathkeys)))

Additional comments on the condition above are:

> path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
> the check against them isn't needed. If you have concern on that,
> we can add that as Assert().
> 
> I feel uncomfortable to look into indexinfo there. Couldnd't we
> use indexskipprefix == -1 to signal !amcanskip from
> create_index_path?
> 
> 
> +          /*
> +           * XXX: In case of index scan quals evaluation happens after
> +           * ExecScanFetch, which means skip results could be fitered out
> +           */
> 
> Why can't we use skipscan path if having filter condition?  If
> something bad happens, the reason must be written here instead of
> what we do.
> 
> 
> +          if (path->pathtype == T_IndexScan &&
> +            parse->jointree != NULL &&
> +            parse->jointree->quals != NULL &&
> +            ((List *)parse->jointree->quals)->length != 0)
> 
> It's better to use list_length instead of peeping inside. It
> handles the NULL case as well. (The structure has recently
> changed but .length is not, though.)
> 
> 
> +     * If advancing direction is different from index direction, we must
> +     * skip right away, but _bt_skip requires a starting point.
> 
> It doesn't seem needed to me. Could you elaborate on the reason
> for that?
> 
> 
> +     * If advancing direction is different from index direction, we must
> +     * skip right away, but _bt_skip requires a starting point.
> +     */
> +    if (direction * indexonlyscan->indexorderdir < 0 &&
> +      !node->ioss_FirstTupleEmitted)
> 
> I'm confused by this. "direction" there is the physical scan
> direction (fwd/bwd) of index scan, which is already compensated
> by indexorderdir. Thus the condition means we do that when
> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
> "index direction" exactly means...)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

28 July 2019, 19:17:14

> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> I have some comments.

Thank you for the review!

> +           * The order of columns in the index should be the same, as for
> +           * unique distincs pathkeys, otherwise we cannot use _bt_search
> +           * in the skip implementation - this can lead to a missing
> +           * records.
>
> It seems that it is enough that distinct pathkeys is contained in
> index pathkeys. If it's right, that is almost checked in existing
> code:

Looks like you're right. After looking closely there seems to be an issue in
the original implementation, when we use wrong prefix_size in such cases.
Without this problem this condition is indeed enough.

> >        if (enable_indexskipscan &&
> >          IsA(path, IndexPath) &&
> >          ((IndexPath *) path)->indexinfo->amcanskip &&
> >          (path->pathtype == T_IndexOnlyScan ||
> >           path->pathtype == T_IndexScan) &&
> >          (needed_pathkeys == root->distinct_pathkeys ||
> >           pathkeys_contained_in(root->distinct_pathkeys,
> >                       path->pathkeys)))
>
> path->pathtype is always one of T_IndexOnlyScan or T_IndexScan so
> the check against them isn't needed. If you have concern on that,
> we can add that as Assert().
>
> +          if (path->pathtype == T_IndexScan &&
> +            parse->jointree != NULL &&
> +            parse->jointree->quals != NULL &&
> +            ((List *)parse->jointree->quals)->length != 0)
>
> It's better to use list_length instead of peeping inside. It
> handles the NULL case as well. (The structure has recently
> changed but .length is not, though.)

Yeah, will change both (hopefully soon)

> +          /*
> +           * XXX: In case of index scan quals evaluation happens after
> +           * ExecScanFetch, which means skip results could be fitered out
> +           */
>
> Why can't we use skipscan path if having filter condition?  If
> something bad happens, the reason must be written here instead of
> what we do.

Sorry, looks like I've failed to express this more clear in the commentary. The
point is that when an index scan (not for index only scan) has some conditions,
their evaluation happens after skipping, and I don't see any not too much
invasive way to apply skip correctly.

>
> +     * If advancing direction is different from index direction, we must
> +     * skip right away, but _bt_skip requires a starting point.
>
> It doesn't seem needed to me. Could you elaborate on the reason
> for that?

This is needed for e.g. scan with a cursor backward without an index condition.
E.g. if we have:

    1 1 2 2 3 3
    1 2 3 4 5 6

and do

    DECLARE c SCROLL CURSOR FOR
    SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;

    FETCH ALL FROM c;

we should get

    1 2 3
    1 3 5

When afterwards we do

    FETCH BACKWARD ALL FROM c;

we should get

    3 2 1
    5 2 1

If we will use _bt_next first time without _bt_skip, the first pair would be
3 6 (the first from the end of the tuples, not from the end of the cursor).

> +     * If advancing direction is different from index direction, we must
> +     * skip right away, but _bt_skip requires a starting point.
> +     */
> +    if (direction * indexonlyscan->indexorderdir < 0 &&
> +      !node->ioss_FirstTupleEmitted)
>
> I'm confused by this. "direction" there is the physical scan
> direction (fwd/bwd) of index scan, which is already compensated
> by indexorderdir. Thus the condition means we do that when
> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
> "index direction" exactly means...)

I'm not sure I follow, what do you mean by compensated? In general you're
right, as David Rowley mentioned above, indexorderdir is a general scan
direction, and direction is flipped estate->es_direction, which is a cursor
direction. The goal of this condition is catch when those two are different,
and we need to advance and read in different directions.

...

From

Kyotaro Horiguchi

Date:

29 July 2019, 08:31:20

Hello.

On 2019/07/29 4:17, Dmitry Dolgov wrote:>> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
> Yeah, will change both (hopefully soon)

Thanks. 

>> +          /*
>> +           * XXX: In case of index scan quals evaluation happens after
>> +           * ExecScanFetch, which means skip results could be fitered out
>> +           */
>>
>> Why can't we use skipscan path if having filter condition?  If
>> something bad happens, the reason must be written here instead of
>> what we do.
> 
> Sorry, looks like I've failed to express this more clear in the
> commentary. The point is that when an index scan (not for index
> only scan) has some conditions, their evaluation happens after
> skipping, and I don't see any not too much invasive way to
> apply skip correctly.

Yeah, your explanation was perfect for me. What I failed to
understand was what is expected to be done in the case. I
reconsidered and understood that:

For example, the following query:

select distinct (a, b) a, b, c from t where  c < 100;

skip scan returns one tuple for one distinct set of (a, b) with
arbitrary one of c, If the choosed c doesn't match the qual and
there is any c that matches the qual, we miss that tuple.

If this is correct, an explanation like the above might help.


>> +     * If advancing direction is different from index direction, we must
>> +     * skip right away, but _bt_skip requires a starting point.
>>
>> It doesn't seem needed to me. Could you elaborate on the reason
>> for that?
> 
> This is needed for e.g. scan with a cursor backward without an index condition.
> E.g. if we have:
> 
>      1 1 2 2 3 3
>      1 2 3 4 5 6
> 
> and do
> 
>      DECLARE c SCROLL CURSOR FOR
>      SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;
> 
>      FETCH ALL FROM c;
> 
> we should get
> 
>      1 2 3
>      1 3 5
> 
> When afterwards we do
> 
>      FETCH BACKWARD ALL FROM c;
> 
> we should get
> 
>      3 2 1
>      5 2 1
> 
> 
> If we will use _bt_next first time without _bt_skip, the first pair would be
> 3 6 (the first from the end of the tuples, not from the end of the cursor).

Thanks for the explanation. Sorry, I somehow thought that that is
right. You're right.

>> +     * If advancing direction is different from index direction, we must
>> +     * skip right away, but _bt_skip requires a starting point.
>> +     */
>> +    if (direction * indexonlyscan->indexorderdir < 0 &&
>> +      !node->ioss_FirstTupleEmitted)
>>
>> I'm confused by this. "direction" there is the physical scan
>> direction (fwd/bwd) of index scan, which is already compensated
>> by indexorderdir. Thus the condition means we do that when
>> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
>> "index direction" exactly means...)
> 
> I'm not sure I follow, what do you mean by compensated? In general you're

I meant that the "direction" is already changed to physical order
at the point.

> right, as David Rowley mentioned above, indexorderdir is a general scan
> direction, and direction is flipped estate->es_direction, which is a cursor
> direction. The goal of this condition is catch when those two are different,
> and we need to advance and read in different directions.

Mmm. Sorry and thank you for the explanation. I was
stupid. You're right. I perhaps mistook indexorderdir's
meaning. Maybe something like the following will work *for me*:p

| When we are fetching a cursor in backward direction, return the
| tuples that forward fetching should have returned. In other
| words, we return the last scanned tuple in a DISTINCT set. Skip
| to that tuple before returning the first tuple.

# Of course, I need someone to correct this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

02 August 2019, 12:14:06

> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> I feel uncomfortable to look into indexinfo there. Couldnd't we
> use indexskipprefix == -1 to signal !amcanskip from
> create_index_path?

Looks like it's not that straightforward to do this only in create_index_path,
since to make this decision we need to have both parts, indexinfo and distinct
keys.

> Yeah, your explanation was perfect for me. What I failed to
> understand was what is expected to be done in the case. I
> reconsidered and understood that:
>
> For example, the following query:
>
> select distinct (a, b) a, b, c from t where  c < 100;
>
> skip scan returns one tuple for one distinct set of (a, b) with
> arbitrary one of c, If the choosed c doesn't match the qual and
> there is any c that matches the qual, we miss that tuple.
>
> If this is correct, an explanation like the above might help.

Yes, that's correct, I've added this into commentaries.

> Maybe something like the following will work *for me*:p
>
> | When we are fetching a cursor in backward direction, return the
> | tuples that forward fetching should have returned. In other
> | words, we return the last scanned tuple in a DISTINCT set. Skip
> | to that tuple before returning the first tuple.

And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.

Re: Index Skip Scan

From

Jesper Pedersen

Date:

02 August 2019, 16:03:09

Hi,

On 8/2/19 8:14 AM, Dmitry Dolgov wrote:
> And this too (slightly rewritten:). We will soon post the new version of patch
> with updates about UniqueKey from Jesper.
> 

Yes.

We decided to send this now, although there is still feedback from David 
that needs to be considered/acted on.

The patches can be reviewed independently, but we will send them as a 
set from now on. Development of UniqueKey will be kept separate though [1].

Note, that while UniqueKey can form the foundation of optimizations for 
GROUP BY queries it isn't the focus for this patch series. Contributions 
are very welcomed of course.

[1] https://github.com/jesperpedersen/postgres/tree/uniquekey

Best regards,
  Jesper

Attachment

Re: Index Skip Scan

From

Floris Van Nee

Date:

05 August 2019, 10:05:39

Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet.

In my previous review I mentioned that queries of the form:

select distinct on(a) a,b from a where a=1;

do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c

/* Consider index skip scan as well */

if (enable_indexskipscan &&

IsA(path, IndexPath) &&

((IndexPath *) path)->indexinfo->amcanskip &&

root->distinct_pathkeys != NIL)

The root->distinct_pathkeys is already filtered for redundant keys, so column 'a' is not in there anymore. Still, it'd be much faster to use the skip scan here, because a regular scan will scan all entries with a=1, even though we're really only interested in the first one. In previous versions, this would be fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things change a bit now that the patch is rebased on the unique-keys patch. Would it be valid to change this check to root->query_uniquekeys != NIL to consider skip scans also for this query?

Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I don't think _bt_read_closest is correctly implemented, and I'm not sure if it can be used at all, due to concerns by Tom and Peter about such approach. I had a similar idea to only partially read items from a page for another use case, for which I submitted a patch last Friday. However, both Tom and Peter find this idea quite scary [1]. You could take a look at my patch on that thread to see the approach taken to correctly partially read a page (well, correct as far as I can see so far...), but perhaps we need to just use the regular _bt_readpage function which reads everything, although this is unfortunate from a performance point of view, since most of the time we're indeed just interested in the first tuple.

Eg. it looks like there's some mixups between index offsets and heap tid's in _bt_read_closest:

* Save the current item and the previous, even if the

* latter does not pass scan key conditions

ItemPointerData tid = prevItup->t_tid;

OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);

_bt_saveitem(so, itemIndex, prevOffnum, prevItup);

itemIndex++;

_bt_saveitem(so, itemIndex, offnum, itup);

itemIndex++;

The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate:

create table a (a int, b int, c int);

insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);

create index on a (a,b);

analyze a;

select distinct on (a) a,b from a order by a,b;

a | b

---+---

1 | 1

2 | 1

3 | 1

4 | 1

5 | 1

(5 rows)

delete from a where a=2 and b=1;

DELETE 1

select distinct on (a) a,b from a order by a,b;

a | b

---+-----

1 | 1

2 | 249 ->> this should be b=2, because we deleted a=2 && b=1. however, it doesn't consider any tuples from that page anymore and gives us the first tuple from the next page.

3 | 1

4 | 1

5 | 1

(5 rows)

-Floris

[1] https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

05 August 2019, 20:38:52

> On Mon, Aug 5, 2019 at 12:05 PM Floris Van Nee <florisvannee@optiver.com>
> wrote:
>
> The root->distinct_pathkeys is already filtered for redundant keys, so column
> 'a' is not in there anymore. Still, it'd be much faster to use the skip scan
> here, because a regular scan will scan all entries with a=1, even though
> we're really only interested in the first one. In previous versions, this
> would be fixed by changing the check in planner.c to use
> root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things
> change a bit now that the patch is rebased on the unique-keys patch. Would it
> be valid to change this check to root->query_uniquekeys != NIL to consider
> skip scans also for this query?

[including a commentary from Jesper]
On Mon, Aug 5, 2019 at 6:55 PM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Yes, the check should be for that. However, the query in question
doesn't have any query_pathkeys, and hence query_uniquekeys in
standard_qp_callback(), so therefore it isn't supported.

Your scenario is covered by one of the test cases in case the
functionality is supported. But, I think that is outside the scope of
the current patch.

> However, currently, the patch just reads the closest and then doesn't
> consider this page at all anymore, if the first tuple skipped to turns out to
> be not visible. Consider the following sql to illustrate:

For the records, the purpose of `_bt_read_closest` is not so much to reduce
amount of data we read from a page, but more to correctly handle those
situations we were discussing before with reading forward/backward in cursors,
since for that in some cases we need to remember previous values for stepping
to the next. I've limited number of items, fetched in this function just
because I was misled by having a check for dead tuples in `_bt_skip`. Of course
we can modify it to read a whole page and leave visibility check for the code
after `index_getnext_tid` (although in case if we know that all tuples on this
page are visilbe I guess it's not strictly necessary, but we still get
improvement from skipping itself).

Re: Index Skip Scan

From

Floris Van Nee

Date:

06 August 2019, 08:37:55

> Yes, the check should be for that. However, the query in question
> doesn't have any query_pathkeys, and hence query_uniquekeys in
> standard_qp_callback(), so therefore it isn't supported

> Your scenario is covered by one of the test cases in case the
> functionality is supported. But, I think that is outside the scope of
> the current patch.

Ah alright, thanks. That makes it clear why it doesn't work.
From a user point of view I think it's rather strange that
SELECT DISTINCT ON (a) a,b FROM a WHERE a BETWEEN 2 AND 2
would give a fast skip scan, even though the more likely query that someone would write
SELECT DISTINCT ON (a) a,b FROM a WHERE a=2
would not.
It is something we could be leave up to the next patch though.

Something else I just noticed which I'm just writing here for awareness; I don't think it's that pressing at the moment and can be left to another patch. When there are multiple indices on a table the planner gets confused and doesn't select an index-only skip scan even though it could. I'm guessing it just takes the first available index based on the DISTINCT clause and then doesn't look further, eg.
With an index on (a,b) and (a,c,b):
postgres=# explain select distinct on (a) a,c,b FROM a;
QUERY PLAN
--------------------------------------------------------------------
Index Scan using a_a_b_idx on a (cost=0.29..1.45 rows=5 width=12)
Skip scan mode: true
(2 rows)
-> This could be an index only scan with the (a,b,c) index.

> For the records, the purpose of `_bt_read_closest` is not so much to reduce
> amount of data we read from a page, but more to correctly handle those
> situations we were discussing before with reading forward/backward in cursors,
> since for that in some cases we need to remember previous values for stepping
> to the next. I've limited number of items, fetched in this function just
> because I was misled by having a check for dead tuples in `_bt_skip`. Of course
> we can modify it to read a whole page and leave visibility check for the code
> after `index_getnext_tid` (although in case if we know that all tuples on this
> page are visilbe I guess it's not strictly necessary, but we still get
> improvement from skipping itself).

I understand and I agree - primary purpose why we chose this function was to make it work correctly. I don't think it would be something for this patch to use the optimization of partially reading a page. My point was however, if this optimization was allowed in a future patch, it would have great performance benefits.
To fix the current patch, we'd indeed need to read the full page. It'd be good to take a close look at the implementation of this function then, because messing around with the previous/next is also not trivial. I think the current implementation also has a problem when the item that is skipped to, is the first item on the page. Eg. (this depends on page size)

postgres=# drop table if exists b; create table b as select a,b from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b); analyze b;

DROP TABLE

SELECT 1830

CREATE INDEX

ANALYZE

postgres=# select distinct on(a) a,b from b;

a | b

---+---

1 | 1

2 | 2 <-- (2,1) is the first item on the page and doesn't get selected by read_closest function. it returns the second item on page which is (2,2)

3 | 2

4 | 2

5 | 2

(5 rows)

-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

28 August 2019, 15:53:59

> On Mon, Aug 5, 2019 at 10:38 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Of course we can modify it to read a whole page and leave visibility check
> for the code after `index_getnext_tid` (although in case if we know that all
> tuples on this page are visilbe I guess it's not strictly necessary, but we
> still get improvement from skipping itself).

Sorry for long delay. Here is more or less what I had in mind. After changing
read_closest to read the whole page I couldn't resist to just merge it into
readpage itself, since it's basically the same. It could raise questions about
performance and how intrusive this patch is, but I hope it's not that much of a
problem (in the worst case we can split it back). I've also added few tests for
the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
put into reviewing!

Attachment

v24-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Floris Van Nee

Date:

28 August 2019, 19:31:51

> Sorry for long delay. Here is more or less what I had in mind. After changing
> read_closest to read the whole page I couldn't resist to just merge it into
> readpage itself, since it's basically the same. It could raise questions about>
> performance and how intrusive this patch is, but I hope it's not that much of a
> problem (in the worst case we can split it back). I've also added few tests for
> the issue you've mentioned. Thanks again, I'm appreciate how much efforts you
> put into reviewing!

Putting it into one function makes sense I think. Looking at the patch, I think in general there are some good improvements in there.

I'm afraid I did manage to find another incorrect query result though, having to do with the keepPrev part and skipping to the first tuple on an index page:

postgres=# drop table if exists b; create table b as select a,b::int2 b,(b%2)::int2 c from generate_series(1,5) a, generate_series(1,366) b; create index on b (a,b,c); analyze b;

DROP TABLE

SELECT 1830

CREATE INDEX

ANALYZE

postgres=# set enable_indexskipscan=1;

SET

postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;

a | b | c

---+---+---

1 | 2 | 0

2 | 4 | 0

3 | 4 | 0

4 | 4 | 0

5 | 4 | 0

(5 rows)

postgres=# set enable_indexskipscan=0;

SET

postgres=# select distinct on (a) a,b,c from b where b>=1 and c=0 order by a,b;

a | b | c

---+---+---

1 | 2 | 0

2 | 2 | 0

3 | 2 | 0

4 | 2 | 0

5 | 2 | 0

(5 rows)

-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

02 September 2019, 13:28:08

> On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee@optiver.com> wrote:
>
> I'm afraid I did manage to find another incorrect query result though

Yes, it's an example of what I was mentioning before, that the current modified
implementation of `_bt_readpage`  wouldn't work well in case of going between
pages. So far it seems that the only problem we can have is when previous and
next items located on a different pages. I've checked how this issue can be
avoided, I hope I will post a new version relatively soon.

Re: Index Skip Scan

From

Alvaro Herrera

Date:

04 September 2019, 20:45:38

Surely it isn't right to add members prefixed with "ioss_" to
struct IndexScanState.

I'm surprised about this "FirstTupleEmitted" business.  Wouldn't it make
more sense to implement index_skip() to return the first tuple if the
scan is just starting?  (I know little about executor, apologies if this
is a stupid question.)

It would be good to get more knowledgeable people to review this patch.
It's clearly something we want, yet it's been there for a very long
time.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

05 September 2019, 19:20:06

> On Mon, Sep 2, 2019 at 3:28 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Aug 28, 2019 at 9:32 PM Floris Van Nee <florisvannee@optiver.com> wrote:
> >
> > I'm afraid I did manage to find another incorrect query result though
>
> Yes, it's an example of what I was mentioning before, that the current modified
> implementation of `_bt_readpage`  wouldn't work well in case of going between
> pages. So far it seems that the only problem we can have is when previous and
> next items located on a different pages. I've checked how this issue can be
> avoided, I hope I will post a new version relatively soon.

Here is the version in which stepping between the pages works better. It seems
sufficient to fix the case you've mentioned before, but for that we need to
propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
can't say I like this solution. I have an idea that maybe it would be simpler
to teach the code after index_skip to not do `_bt_next` right after one skip
happened before. It should immediately elliminate several hacks from index skip
itself, so I'll try to pursue this idea.

> On Wed, Sep 4, 2019 at 10:45 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Thank you for checking it out!

> Surely it isn't right to add members prefixed with "ioss_" to
> struct IndexScanState.

Yeah, sorry. I've incorporated IndexScan support originally only to show that
it's possible (with some limitations), but after that forgot to clean up. Now
those fields are renamed.

> I'm surprised about this "FirstTupleEmitted" business.  Wouldn't it make
> more sense to implement index_skip() to return the first tuple if the
> scan is just starting?  (I know little about executor, apologies if this
> is a stupid question.)

I'm not entirely sure, which exactly part do you mean? Now the first tuple is
returned by `_bt_first`, how would it help if index_skip will return it?

> It would be good to get more knowledgeable people to review this patch.
> It's clearly something we want, yet it's been there for a very long
> time.

Sure, that would be nice.

Attachment

v25-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Alvaro Herrera from 2ndQuadrant

Date:

05 September 2019, 19:41:28

On 2019-Sep-05, Dmitry Dolgov wrote:

> Here is the version in which stepping between the pages works better. It seems
> sufficient to fix the case you've mentioned before, but for that we need to
> propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
> can't say I like this solution. I have an idea that maybe it would be simpler
> to teach the code after index_skip to not do `_bt_next` right after one skip
> happened before. It should immediately elliminate several hacks from index skip
> itself, so I'll try to pursue this idea.

Cool.

I think multiplying two ScanDirections to watch for a negative result is
pretty ugly:

        /*
         * If advancing direction is different from index direction, we must
         * skip right away, but _bt_skip requires a starting point.
         */
        if (direction * indexscan->indexorderdir < 0 &&
            !node->iss_FirstTupleEmitted)

Surely there's a better way to code that?

I think "scanstart" needs more documentation, both in the SGML docs as
well as the code comments surrounding it.

Please disregard my earlier comment about FirstTupleEmitted.  I was
thinking that index_skip would itself emit a tuple (ie. call some
"getnext" internally) rather than just repositioning.  There might still
be some more convenient way to represent this, but I have no immediate
advice.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

22 September 2019, 13:45:01

> On Thu, Sep 5, 2019 at 9:41 PM Alvaro Herrera from 2ndQuadrant <alvherre@alvh.no-ip.org> wrote:
>
> On 2019-Sep-05, Dmitry Dolgov wrote:
>
> > Here is the version in which stepping between the pages works better. It seems
> > sufficient to fix the case you've mentioned before, but for that we need to
> > propagate keepPrev logic through `_bt_steppage` & `_bt_readnextpage`, and I
> > can't say I like this solution. I have an idea that maybe it would be simpler
> > to teach the code after index_skip to not do `_bt_next` right after one skip
> > happened before. It should immediately elliminate several hacks from index skip
> > itself, so I'll try to pursue this idea.
>
> Cool.

Here it is. Since now the code after index_skip knows whether to do
index_getnext or not, it's possible to use unmodified `_bt_readpage` /
`_bt_steppage`. To achieve that there is a flag that indicated whether or not
we were skipping to the current item (I guess it's possible to implement it
without such a flag, but the at the end result looked more ugly as for me). On
the way I've simplified few things, and all the tests we accumulated before are
still passing. I'm almost sure it's possible to implement some parts of the
code more elegant, but don't see yet how.

> I think multiplying two ScanDirections to watch for a negative result is
> pretty ugly:

Probably, but the only alternative I see to check if directions are opposite is
to check that directions come in pairs (back, forth), (forth, back). Is there
an easier way?

> I think "scanstart" needs more documentation, both in the SGML docs as
> well as the code comments surrounding it.

I was able to remove it after another round of simplification.

Attachment

v26-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Alvaro Herrera

Date:

23 September 2019, 02:02:04

On 2019-Sep-22, Dmitry Dolgov wrote:

> > I think multiplying two ScanDirections to watch for a negative result is
> > pretty ugly:
> 
> Probably, but the only alternative I see to check if directions are opposite is
> to check that directions come in pairs (back, forth), (forth, back). Is there
> an easier way?

Maybe use the ^ operator?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Kyotaro Horiguchi

Date:

24 September 2019, 08:35:47

At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190923020204.GA2781@alvherre.pgsql>
> On 2019-Sep-22, Dmitry Dolgov wrote:
> 
> > > I think multiplying two ScanDirections to watch for a negative result is
> > > pretty ugly:
> > 
> > Probably, but the only alternative I see to check if directions are opposite is
> > to check that directions come in pairs (back, forth), (forth, back). Is there
> > an easier way?
> 
> Maybe use the ^ operator?

It's not a logical operator but a bitwise arithmetic operator,
which cannot be used if the operands is guaranteed to be 0 or 1
(in integer).  In a-kind-of-standard, but hacky way, "(!a != !b)"
works as desired since ! is a logical operator.

Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it
some good way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Kyotaro Horiguchi

Date:

24 September 2019, 08:41:46

At Tue, 24 Sep 2019 17:35:47 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190924.173547.226622711.horikyota.ntt@gmail.com>
> At Sun, 22 Sep 2019 23:02:04 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190923020204.GA2781@alvherre.pgsql>
> > On 2019-Sep-22, Dmitry Dolgov wrote:
> > 
> > > > I think multiplying two ScanDirections to watch for a negative result is
> > > > pretty ugly:
> > > 
> > > Probably, but the only alternative I see to check if directions are opposite is
> > > to check that directions come in pairs (back, forth), (forth, back). Is there
> > > an easier way?
> > 
> > Maybe use the ^ operator?
> 
> It's not a logical operator but a bitwise arithmetic operator,
> which cannot be used if the operands is guaranteed to be 0 or 1
> (in integer).  In a-kind-of-standard, but hacky way, "(!a != !b)"
> works as desired since ! is a logical operator.
> 
> Wouldn't we use (a && !b) || (!a && b)? Compiler will optimize it
> some good way.

Sorry, it's not a boolean. A tristate value. From the definition
(Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
NoMovement did not exist. If it is not guranteed,

(dir1 != 0 && dir1 == -dir2) ?


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Alvaro Herrera

Date:

24 September 2019, 12:06:27

On 2019-Sep-24, Kyotaro Horiguchi wrote:

> Sorry, it's not a boolean. A tristate value. From the definition
> (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
> NoMovement did not exist. If it is not guranteed,
> 
> (dir1 != 0 && dir1 == -dir2) ?

Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
definition? :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

From

Kyotaro Horiguchi

Date:

25 September 2019, 01:03:01

At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190924120627.GA12454@alvherre.pgsql>
> On 2019-Sep-24, Kyotaro Horiguchi wrote:
> 
> > Sorry, it's not a boolean. A tristate value. From the definition
> > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
> > NoMovement did not exist. If it is not guranteed,
> > 
> > (dir1 != 0 && dir1 == -dir2) ?
> 
> Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
> definition? :-)

Yeah, sounds good to establish it as a part of ScanDirection's
definition.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

25 September 2019, 09:34:22

> On Wed, Sep 25, 2019 at 3:03 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Tue, 24 Sep 2019 09:06:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190924120627.GA12454@alvherre.pgsql>
> > On 2019-Sep-24, Kyotaro Horiguchi wrote:
> >
> > > Sorry, it's not a boolean. A tristate value. From the definition
> > > (Back, NoMove, Forward) = (-1, 0, 1), (dir1 == -dir2) if
> > > NoMovement did not exist. If it is not guranteed,
> > >
> > > (dir1 != 0 && dir1 == -dir2) ?
> >
> > Maybe just add ScanDirectionIsOpposite(dir1, dir2) with that
> > definition? :-)
>
> Yeah, sounds good to establish it as a part of ScanDirection's
> definition.

Yep, this way looks better.

Attachment

v27-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Peter Geoghegan

Date:

02 November 2019, 18:56:08

On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> v27-0001-Index-skip-scan.patch

Some random thoughts on this:

* Is _bt_scankey_within_page() really doing the right thing within empty pages?

It looks like you're accidentally using the high key when the leaf
page is empty with forward scans (assuming that the leaf page isn't
rightmost). You'll need to think about empty pages for both forward
and backward direction scans there.

Actually, using the high key in some cases may be desirable, once the
details are worked out -- the high key is actually very helpful with
low cardinality indexes. If you populate an index with retail
insertions (i.e. don't just do a CREATE INDEX after the table is
populated), and use low cardinality data in the indexed columns then
you'll see this effect. You can have a few hundred index entries for
each distinct value, and the page split logic added to Postgres 12 (by
commit fab25024) will still manage to "trap" each set of duplicates on
their own distinct leaf page. Leaf pages will have a high key that
looks like the values that appear on the page to the right. The goal
is for forward index scans to access the minimum number of leaf pages,
especially with low cardinality data and with multi-column indexes.
(See also: commit 29b64d1d)

A good way to see this for yourself is to get the Wisconsin Benchmark
tables (the tenk1 table and related tables from the regression tests)
populated using retail insertions. "CREATE TABLE __tenk1(like tenk1
including indexes); INSERT INTO __tenk1 SELECT * FROM tenk1;" is how I
like to set this up. Then you can see that we only access one leaf
page easily by forcing bitmap scans (i.e. "set enable* ..."), and
using "EXPLAIN (analyze, buffers) SELECT ... FROM __tenk1 WHERE ...",
where the SELECT query is a simple point lookup query (bitmap scans
happen to instrument the index buffer accesses in a way that makes it
absolutely clear how many index page buffers were touched). IIRC the
existing tenk1 indexes have no more than a few hundred duplicates for
each distinct value in all cases, so only one leaf page needs to be
accessed by simple "key = val" queries in all cases.

(I imagine that the "four" index you add in the regression test would
generally need to visit more than one leaf page for simple point
lookup queries, but in any case the high key is a useful way of
detecting a "break" in the values when indexing low cardinality data
-- these breaks are generally "aligned" to leaf page boundaries.)

I also like to visualize the keyspace of indexes when poking around at
that stuff, generally by using some of the queries that you can find
on the Wiki [1].

* The extra scankeys that you are using in most of the new nbtsearch.c
code is an insertion scankey -- not a search style scankey. I think
that you should try to be a bit clearer on that distinction in
comments. It is already confusing now, but at least there is only zero
or one insertion scankeys per scan (for the initial positioning).

* There are two _bt_skip() prototypes in nbtree.h (actually, there is
a btskip() and a _bt_skip()). I understand that the former is a public
wrapper of the latter, but I find the naming a little bit confusing.
Maybe rename _bt_skip() to something that is a little bit more
suggestive of its purpose.

* Suggest running pgindent on the patch.

[1] https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index
-- 
Peter Geoghegan

Re: Index Skip Scan

From

Peter Geoghegan

Date:

02 November 2019, 20:32:08

On Sat, Nov 2, 2019 at 11:56 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > v27-0001-Index-skip-scan.patch
>
> Some random thoughts on this:

And now some more:

* I'm confused about this code in _bt_skip():

>     /*
>      * Andvance backward but read forward. At this moment we are at the next
>      * distinct key at the beginning of the series. In case if scan just
>      * started, we can read forward without doing anything else. Otherwise find
>      * previous distinct key and the beginning of it's series and read forward
>      * from there. To do so, go back one step, perform binary search to find
>      * the first item in the series and let _bt_readpage do everything else.
>      */
>     else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
>     {
>         if (!scanstart)
>         {
>             _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
>             offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
>
>             /* One step back to find a previous value */
>             _bt_readpage(scan, dir, offnum);

Why is it okay to call _bt_drop_lock_and_maybe_pin() like this? It
looks like that will drop the lock (but not the pin) on the same
buffer that you binary search with _bt_binsrch() (since the local
variable "buf" is also the buf in "so->currPos").

* It also seems a bit odd that you assume that the scan is
"scan->xs_want_itup", but then check that condition many times. Why
bother?

* Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
rather than just unlocking the buffer directly? We'll only drop the
pin for a scan that is "!scan->xs_want_itup", which is never the case
within _bt_skip().

I think that the macros and stuff that manage pins and buffer locks in
nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some
value in trying to be consistent with existing nbtsearch.c code in
ways that aren't strictly necessary.

* Not sure why you need this code after throwing an error:

>                 else
>                 {
>                     elog(ERROR, "Could not read closest index tuples: %d", offnum);
>                     pfree(so->skipScanKey);
>                     so->skipScanKey = NULL;
>                     return false;
>                 }

[1] https://www.postgresql.org/message-id/flat/CAH2-Wz=m674-RKQdCG+jCD9QGzN1Kcg-FOdYw4-j+5_PfcHbpQ@mail.gmail.com
--
Peter Geoghegan

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

03 November 2019, 16:45:47

> On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> v27-0001-Index-skip-scan.patch
>
> Some random thoughts on this:

Thanks a lot for the commentaries!

> * Is _bt_scankey_within_page() really doing the right thing within empty pages?
>
> It looks like you're accidentally using the high key when the leaf
> page is empty with forward scans (assuming that the leaf page isn't
> rightmost). You'll need to think about empty pages for both forward
> and backward direction scans there.

Yes, you're right, that's an issue I need to fix.

> Actually, using the high key in some cases may be desirable, once the
> details are worked out -- the high key is actually very helpful with
> low cardinality indexes. If you populate an index with retail
> insertions (i.e. don't just do a CREATE INDEX after the table is
> populated), and use low cardinality data in the indexed columns then
> you'll see this effect.

Can you please elaborate a bit more? I see that using high key will help
a forward index scans to access the minimum number of leaf pages, but
I'm not following how is it connected to the _bt_scankey_within_page? Or
is this commentary related in general to the whole implementation?

> * The extra scankeys that you are using in most of the new nbtsearch.c
> code is an insertion scankey -- not a search style scankey. I think
> that you should try to be a bit clearer on that distinction in
> comments. It is already confusing now, but at least there is only zero
> or one insertion scankeys per scan (for the initial positioning).
>
> * There are two _bt_skip() prototypes in nbtree.h (actually, there is
> a btskip() and a _bt_skip()). I understand that the former is a public
> wrapper of the latter, but I find the naming a little bit confusing.
> Maybe rename _bt_skip() to something that is a little bit more
> suggestive of its purpose.
>
> * Suggest running pgindent on the patch.

Sure, I'll incorporate mentioned improvements into the next patch
version (hopefully soon).

> And now some more:
>
> * I'm confused about this code in _bt_skip():
>
Yeah, it shouldn't be there, but rather before _bt_next, that expects
unlocked buffer. Will fix.

> * It also seems a bit odd that you assume that the scan is
> "scan->xs_want_itup", but then check that condition many times. Why
> bother?
>
> * Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
> rather than just unlocking the buffer directly? We'll only drop the
> pin for a scan that is "!scan->xs_want_itup", which is never the case
> within _bt_skip().
>
> I think that the macros and stuff that manage pins and buffer locks in
> nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some
> value in trying to be consistent with existing nbtsearch.c code in
> ways that aren't strictly necessary.

Yep, I've seen this thread, but tried to be consistent with the
surrounding core style. Probably it indeed doesn't make sense.

> * Not sure why you need this code after throwing an error:
>
> >                 else
> >                 {
> >                     elog(ERROR, "Could not read closest index tuples: %d", offnum);
> >                     pfree(so->skipScanKey);
> >                     so->skipScanKey = NULL;
> >                     return false;
> >                 }

Unfortunately this is just a leftover from a previous version. Sorry for
that, will get rid of it.

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

07 November 2019, 15:43:58

> On Sun, Nov 03, 2019 at 05:45:47PM +0100, Dmitry Dolgov wrote:
> > * The extra scankeys that you are using in most of the new nbtsearch.c
> > code is an insertion scankey -- not a search style scankey. I think
> > that you should try to be a bit clearer on that distinction in
> > comments. It is already confusing now, but at least there is only zero
> > or one insertion scankeys per scan (for the initial positioning).
> >
> > * There are two _bt_skip() prototypes in nbtree.h (actually, there is
> > a btskip() and a _bt_skip()). I understand that the former is a public
> > wrapper of the latter, but I find the naming a little bit confusing.
> > Maybe rename _bt_skip() to something that is a little bit more
> > suggestive of its purpose.
> >
> > * Suggest running pgindent on the patch.
>
> Sure, I'll incorporate mentioned improvements into the next patch
> version (hopefully soon).

Here is the new version, that addresses mentioned issues.

> > * Is _bt_scankey_within_page() really doing the right thing within empty pages?
> >
> > It looks like you're accidentally using the high key when the leaf
> > page is empty with forward scans (assuming that the leaf page isn't
> > rightmost). You'll need to think about empty pages for both forward
> > and backward direction scans there.
>
> Yes, you're right, that's an issue I need to fix.

If I didn't misunderstood something, for the purpose of this function it
makes sense to return false in the case of empty page. That's what I've
added into the patch.

> > Actually, using the high key in some cases may be desirable, once the
> > details are worked out -- the high key is actually very helpful with
> > low cardinality indexes. If you populate an index with retail
> > insertions (i.e. don't just do a CREATE INDEX after the table is
> > populated), and use low cardinality data in the indexed columns then
> > you'll see this effect.
>
> Can you please elaborate a bit more? I see that using high key will help
> a forward index scans to access the minimum number of leaf pages, but
> I'm not following how is it connected to the _bt_scankey_within_page? Or
> is this commentary related in general to the whole implementation?

This question is still open.

Attachment

v28-0001-Index-skip-scan.patch

Re: Index Skip Scan

From

Tomas Vondra

Date:

10 November 2019, 21:18:32

Hi,

I've looked at the patch again - in general it seems in pretty good
shape, all the issues I found are mostly minor.

Firstly, I'd like to point out that not all of the things I complained
about in my 2019/06/23 review got addressed. Those were mostly related
to formatting and code style, and the attached patch fixes some (but
maybe not all) of them.

The patch also tweaks wording of some bits in the docs and comments that
I found unclear. Would be nice if a native speaker could take a look.

A couple more comments:


1) pathkey_is_unique

The one additional issue I found is pathkey_is_unique - it's not really
explained what "unique" means and hot it's different from "redundant"
(which has quite a long explanation before pathkey_is_redundant).

My understanding is that pathkey is "unique" when it's EC does not match
an EC of another pathkey in the list. But if that's the case, then the
function name is wrong - it does exactly the opposite (i.e. it returns
'true' when the pathkey is *not* unique).


2) explain

I wonder if we should print the "Skip scan" info always, or similarly to
"Inner Unique" which does this:

/* try not to be too chatty about this in text mode */
     if (es->format != EXPLAIN_FORMAT_TEXT ||
         (es->verbose && ((Join *) plan)->inner_unique))
         ExplainPropertyBool("Inner Unique",
                             ((Join *) plan)->inner_unique,
                             es);
     break;

I'd do the same thing for skip scan - print it only in verbose mode, or
when using non-text output format.


3) There's an annoying limitation that for this to kick in, the order of
expressions in the DISTINCT clause has to match the index, i.e. with
index on (a,b,c) the skip scan will only kick in for queries using

    DISTINCT a
    DISTINCT a,b
    DISTINCT a,b,c

but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
to sort result of DISTINCT in any particular case, except that we don't
consider the other orderings "interesting" so we don't really consider
using the index (so no chance of using the skip scan).

That leads to pretty annoying speedups/slowdowns due to seemingly
irrelevant changes:

-- everything great, a,b,c matches an index
test=# explain (analyze, verbose) select distinct a,b,c from t;
                                                              QUERY PLAN
             
 

-------------------------------------------------------------------------------------------------------------------------------------
  Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25 rows=1330 width=12) (actual time=0.016..10.387
rows=1331loops=1)
 
    Output: a, b, c
    Skip scan: true
    Heap Fetches: 1331
  Planning Time: 0.106 ms
  Execution Time: 10.843 ms
(6 rows)

-- slow, mismatch with index
test=# explain (analyze, verbose) select distinct a,c,b from t;
                                                         QUERY PLAN
   
 

---------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=22906.00..22919.30 rows=1330 width=12) (actual time=802.067..802.612 rows=1331 loops=1)
    Output: a, c, b
    Group Key: t.a, t.c, t.b
    ->  Seq Scan on public.t  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.010..355.361 rows=1000000
loops=1)
          Output: a, b, c
  Planning Time: 0.076 ms
  Execution Time: 803.078 ms
(7 rows)

-- fast again, the extra ordering allows using the index again
test=# explain (analyze, verbose) select distinct a,c,b from t order by a,b,c;
                                                              QUERY PLAN
             
 

-------------------------------------------------------------------------------------------------------------------------------------
  Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25 rows=1330 width=12) (actual time=0.035..12.120
rows=1331loops=1)
 
    Output: a, c, b
    Skip scan: true
    Heap Fetches: 1331
  Planning Time: 0.053 ms
  Execution Time: 12.632 ms
(6 rows)


This is a more generic issue, not specific to this patch, of course. I
think we saw it with the incremental sort patch, IIRC. I wonder how
difficult would it be to fix this here (not necessarily in v1).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

skipscan-review.patch

Re: Index Skip Scan

From

Jesper Pedersen

Date:

11 November 2019, 18:24:51

Hi Tomas,

On 11/10/19 4:18 PM, Tomas Vondra wrote:
> I've looked at the patch again - in general it seems in pretty good
> shape, all the issues I found are mostly minor.
>
> Firstly, I'd like to point out that not all of the things I complained
> about in my 2019/06/23 review got addressed. Those were mostly related
> to formatting and code style, and the attached patch fixes some (but
> maybe not all) of them.
>

Sorry about that !

> The patch also tweaks wording of some bits in the docs and comments that
> I found unclear. Would be nice if a native speaker could take a look.
>
> A couple more comments:
>
>
> 1) pathkey_is_unique
>
> The one additional issue I found is pathkey_is_unique - it's not really
> explained what "unique" means and hot it's different from "redundant"
> (which has quite a long explanation before pathkey_is_redundant).
>
> My understanding is that pathkey is "unique" when it's EC does not match
> an EC of another pathkey in the list. But if that's the case, then the
> function name is wrong - it does exactly the opposite (i.e. it returns
> 'true' when the pathkey is *not* unique).
>

Yeah, you are correct - forgot to move that part from the _uniquekey
version of the patch.

>
> 2) explain
>
> I wonder if we should print the "Skip scan" info always, or similarly to
> "Inner Unique" which does this:
>
> /* try not to be too chatty about this in text mode */
>      if (es->format != EXPLAIN_FORMAT_TEXT ||
>          (es->verbose && ((Join *) plan)->inner_unique))
>          ExplainPropertyBool("Inner Unique",
>                              ((Join *) plan)->inner_unique,
>                              es);
>      break;
>
> I'd do the same thing for skip scan - print it only in verbose mode, or
> when using non-text output format.
>

I think it is of benefit to see if skip scan kicks in, but used your
"Skip scan" suggestion.

>
> 3) There's an annoying limitation that for this to kick in, the order of
> expressions in the DISTINCT clause has to match the index, i.e. with
> index on (a,b,c) the skip scan will only kick in for queries using
>
>     DISTINCT a
>     DISTINCT a,b
>     DISTINCT a,b,c
>
> but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
> to sort result of DISTINCT in any particular case, except that we don't
> consider the other orderings "interesting" so we don't really consider
> using the index (so no chance of using the skip scan).
>
> That leads to pretty annoying speedups/slowdowns due to seemingly
> irrelevant changes:
>
> -- everything great, a,b,c matches an index
> test=# explain (analyze, verbose) select distinct a,b,c from t;
>                                                               QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------

>
>   Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25
> rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1)
>     Output: a, b, c
>     Skip scan: true
>     Heap Fetches: 1331
>   Planning Time: 0.106 ms
>   Execution Time: 10.843 ms
> (6 rows)
>
> -- slow, mismatch with index
> test=# explain (analyze, verbose) select distinct a,c,b from t;
>                                                          QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------

>
>   HashAggregate  (cost=22906.00..22919.30 rows=1330 width=12) (actual
> time=802.067..802.612 rows=1331 loops=1)
>     Output: a, c, b
>     Group Key: t.a, t.c, t.b
>     ->  Seq Scan on public.t  (cost=0.00..15406.00 rows=1000000
> width=12) (actual time=0.010..355.361 rows=1000000 loops=1)
>           Output: a, b, c
>   Planning Time: 0.076 ms
>   Execution Time: 803.078 ms
> (7 rows)
>
> -- fast again, the extra ordering allows using the index again
> test=# explain (analyze, verbose) select distinct a,c,b from t order by
> a,b,c;
>                                                               QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------

>
>   Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25
> rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1)
>     Output: a, c, b
>     Skip scan: true
>     Heap Fetches: 1331
>   Planning Time: 0.053 ms
>   Execution Time: 12.632 ms
> (6 rows)
>
>
> This is a more generic issue, not specific to this patch, of course. I
> think we saw it with the incremental sort patch, IIRC. I wonder how
> difficult would it be to fix this here (not necessarily in v1).
>

Yeah, I see it as separate to this patch as well. But definitely
something that should be revisited.

Thanks for your patch ! v29 using UniqueKey attached.

Best regards,
  Jesper

Attachment

Re: Index Skip Scan

From

Jesper Pedersen

Date:

15 November 2019, 15:05:51

Hi,

On 11/11/19 1:24 PM, Jesper Pedersen wrote:
> v29 using UniqueKey attached.
>

Just a small update to the UniqueKey patch to hopefully keep CFbot happy.

Feedback, especially on the planner changes, would be greatly appreciated.

Best regards,
  Jesper

Attachment

RE: Index Skip Scan

From

Floris Van Nee

Date:

15 January 2020, 13:33:41

Hi all,

I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below.

- I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me:
create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b,
generate_series(1,10000)c); create index on t2 (a,b,c desc); 

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..216.25 rows=500 width=12)
   Skip scan: true
   Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5))
(3 rows)

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Unique  (cost=0.43..8361.56 rows=500 width=12)
   ->  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..8361.56 rows=9807 width=12)
         Index Cond: ((a = 2) AND (b = 5))
(3 rows)

When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index
scaneven though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather
thanwriting b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in
thefuture. I think for the sake of consistency it would make sense if this eventually gets addressed. 

- nodeIndexScan.c, line 126
This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the
side-effectsthis has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin) 

- nbsearch.c, _bt_skip, line 1440
_bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately after
thatoutside of the else. The second call can be removed I think. 

- nbtsearch.c _bt_skip line 1597
                LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
                scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);

This is an UNLOCK followed by a read of the unlocked page. That looks incorrect?

- nbtsearch.c _bt_skip line 1440
if (BTScanPosIsValid(so->currPos) &&
        _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))

Is it allowed to look at the high key / low key of the page without have a read lock on it?

- nbtsearch.c line 1634
if (_bt_readpage(scan, indexdir, offnum))  ...
else
 error()

Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index condition,
notpart of the scan key, makes none of the keys on the page match? 

- nbtsearch.c in general
Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never
releasethe buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1]
I'vebeen trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario
-so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where page
splitsand/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page check
suchthat it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or cursor
scans.Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear example
readywhen it would go wrong. It may well be fine, but it was one of the things on my mind. 

[1] https://postgrespro.com/list/id/1566683972147.11682@Optiver.com

-Floris

Re: Index Skip Scan

From

Jesper Pedersen

Date:

20 January 2020, 19:01:20

Hi Floris,

On 1/15/20 8:33 AM, Floris Van Nee wrote:
> I reviewed the latest version of the patch. Overall some good improvements I think. Please find my feedback below.
>

Thanks for your review !

> - I think I mentioned this before - it's not that big of a deal, but it just looks weird and inconsistent to me:
> create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, generate_series(1,100) b,
generate_series(1,10000)c); create index on t2 (a,b,c desc);
 
> 
> postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and b<=5 order by a,b,c desc;
>                                     QUERY PLAN
> ---------------------------------------------------------------------------------
>   Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..216.25 rows=500 width=12)
>     Skip scan: true
>     Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5))
> (3 rows)
> 
> postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 order by a,b,c desc;
>                                         QUERY PLAN
> -----------------------------------------------------------------------------------------
>   Unique  (cost=0.43..8361.56 rows=500 width=12)
>     ->  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..8361.56 rows=9807 width=12)
>           Index Cond: ((a = 2) AND (b = 5))
> (3 rows)
> 
> When doing a distinct on (params) and having equality conditions for all params, it falls back to the regular index
scaneven though there's no reason not to use the skip scan here. It's much faster to write b between 5 and 5 now rather
thanwriting b=5. I understand this was a limitation of the unique-keys patch at the moment which could be addressed in
thefuture. I think for the sake of consistency it would make sense if this eventually gets addressed.
 
> 

Agreed, that it is an improvement that should be made. I would like 
David's view on this since it relates to the UniqueKey patch.

> - nodeIndexScan.c, line 126
> This sets xs_want_itup to true in all cases (even for non skip-scans). I don't think this is acceptable given the
side-effectsthis has (page will never be unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin)
 
>

Correct - fixed.

> - nbsearch.c, _bt_skip, line 1440
> _bt_update_skip_scankeys(scan, indexRel); This function is called twice now - once in the else {} and immediately
afterthat outside of the else. The second call can be removed I think.
 
> 

Yes, removed the "else" call site.

> - nbtsearch.c _bt_skip line 1597
>                 LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
>                 scan->xs_itup = (IndexTuple) PageGetItem(page, itemid);
> 
> This is an UNLOCK followed by a read of the unlocked page. That looks incorrect?
> 

Yes, that needed to be changed.

> - nbtsearch.c _bt_skip line 1440
> if (BTScanPosIsValid(so->currPos) &&
>         _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> 
> Is it allowed to look at the high key / low key of the page without have a read lock on it?
> 

In case of a split the page will still contain a high key and a low key, 
so this should be ok.

> - nbtsearch.c line 1634
> if (_bt_readpage(scan, indexdir, offnum))  ...
> else
>   error()
> 
> Is it really guaranteed that a match can be found on the page itself? Isn't it possible that an extra index
condition,not part of the scan key, makes none of the keys on the page match?
 
> 

The logic for this has been changed.

> - nbtsearch.c in general
> Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to
neverrelease the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin?
[1]I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a
scenario- so it may be fine. But it would be good to consider again. One thing I was thinking of was a scenario where
pagesplits and/or compacting would happen in between returning tuples. Could this break the _bt_scankey_within_page
checksuch that it thinks the scan key is within the current page, while it actually isn't? Mainly for backward and/or
cursorscans. Forward scans shouldn't be a problem I think. Apologies for being a bit vague as I don't have a clear
exampleready when it would go wrong. It may well be fine, but it was one of the things on my mind.
 
> 

There is a BT_READ lock in place when finding the correct leaf page, or 
searching within the leaf page itself. _bt_vacuum_one_page deletes only 
LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do 
you have some feedback for this ?


Please, find the updated patches attached that Dmitry and I made.

Thanks again !

Best regards,
  Jesper

Hi Peter,

Thanks for your feedback; Dmitry has followed-up with some additional 
questions.

On 1/20/20 8:05 PM, Peter Geoghegan wrote:
>> This is definitely not okay.
> 
> I suggest that you find a way to add assertions to code like
> _bt_readpage() that verify that we do in fact have the buffer content
> lock. 

If you apply the attached patch on master it will fail the test suite; 
did you mean something else ?

Best regards,
  Jesper

Attachment

buffer_locked.txt

RE: Index Skip Scan

From

Floris Van Nee

Date:

22 January 2020, 07:50:30

>
> Could you please elaborate why does it sound like that? If I understand
> correctly, to stop a scan only "between" pages one need to use only
> _bt_readpage/_bt_steppage? Other than that there is no magic with scan
> position in the patch, so I'm not sure if I'm missing something here.

Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page
ithas locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose
wehave a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] 
We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but
unlocked.Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be
full.So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the
situationswe could be left with is: 
Leaf page 1 = [2,4]
Leaf page 2 = [5,6,8]
However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all
matchingelements in our local buffer and we'll return those. But the skip scan currently: 
a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
b) finds out that this is actually true
c) does a search on the page and returns value=4 (while it should have returned value=6)

Peter, is my understanding about the btree internals correct so far?

Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but
readbackward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and
pinon the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum
suchthat the offset number you have in your variable does not match the actual offset number of the tuple on the page
anymore.Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even
thoughthey relate to the same tuple. 


-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

22 January 2020, 16:04:41

> On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote:
>
> Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page
ithas locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose
wehave a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
 
> We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but
unlocked.Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be
full.So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the
situationswe could be left with is:
 
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1.

In case if we just returned a tuple, the next action would be either
check the next page for another key or search down to the tree. Maybe
I'm missing something in your scenario, but the latter will land us on a
required page (we do not point to any leaf here), and before the former
there is a check for high/low key. Is there anything else missing?

> Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but
readbackward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and
pinon the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum
suchthat the offset number you have in your variable does not match the actual offset number of the tuple on the page
anymore.Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even
thoughthey relate to the same tuple.
 

Interesting point. The original idea here was to check that we're not
returned to the same position after jumping, so maybe instead of offsets
we can check a tuple we found.

Re: Index Skip Scan

From

Floris Van Nee

Date:

22 January 2020, 17:24:43

Hi Dmitry,

> > On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote:
> >
> > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1.

> In case if we just returned a tuple, the next action would be either
>> check the next page for another key or search down to the tree. Maybe

But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the _bt_scankey_within_page shortcut we're taking, there's no stepping to a next page involved. It just locks the page again that it previously also locked.

> I'm missing something in your scenario, but the latter will land us on a
> required page (we do not point to any leaf here), and before the former
> there is a check for high/low key. Is there anything else missing?

Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.

Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned.

The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already.

It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.

Re: Index Skip Scan

From

Peter Geoghegan

Date:

22 January 2020, 18:40:33

On Tue, Jan 21, 2020 at 9:06 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> If you apply the attached patch on master it will fail the test suite;
> did you mean something else ?

Yeah, this is exactly what I had in mind for the _bt_readpage() assertion.

As I said, it isn't a great sign that this kind of assertion is even
necessary in index access method code (code like bufmgr.c is another
matter). Usually it's just obvious that a buffer lock is held. I can't
really blame this patch for that, though. You could say the same thing
about the existing "buffer pin held" _bt_readpage() assertion. It's
good that it verifies what is actually a fragile assumption, even
though I'd prefer to not make a fragile assumption.

-- 
Peter Geoghegan

Re: Index Skip Scan

From

Peter Geoghegan

Date:

22 January 2020, 18:55:21

On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
<florisvannee@optiver.com> wrote:
> Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page
ithas locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose
wehave a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] 
> We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but
unlocked.Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be
full.So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the
situationswe could be left with is: 
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all
matchingelements in our local buffer and we'll return those. But the skip scan currently: 
> a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> b) finds out that this is actually true
> c) does a search on the page and returns value=4 (while it should have returned value=6)
>
> Peter, is my understanding about the btree internals correct so far?

This is a good summary. This is the kind of scenario I had in mind
when I expressed a general concern about "stopping between pages".
Processing a whole page at a time is a crucial part of how
_bt_readpage() currently deals with concurrent page splits.

Holding a buffer pin on a leaf page is only effective as an interlock
against VACUUM completely removing a tuple, which could matter with
non-MVCC scans.

> Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but
readbackward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and
pinon the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum
suchthat the offset number you have in your variable does not match the actual offset number of the tuple on the page
anymore.Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even
thoughthey relate to the same tuple. 

If skip scan is restricted to heapkeyspace indexes (i.e. those created
on Postgres 12+), then it might be reasonable to save an index tuple,
and relocate it within the same page using a fresh binary search that
uses a scankey derived from the same index tuple -- without unsetting
scantid/the heap TID scankey attribute. I suppose that you'll need to
"find your place again" after releasing the buffer lock on a leaf page
for a time. Also, I think that this will only be safe with MVCC scans,
because otherwise the page could be concurrently deleted by VACUUM.

--
Peter Geoghegan

Re: Index Skip Scan

From

Peter Geoghegan

Date:

22 January 2020, 19:09:35

On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

Note in particular that index scans cannot return the same index tuple
twice -- processing a page at a time ensures that that cannot happen.

Can a loose index scan return the same tuple (i.e. a tuple with the
same heap TID) to the executor more than once?

-- 
Peter Geoghegan

RE: Index Skip Scan

From

Floris Van Nee

Date:

22 January 2020, 21:08:59

> Note in particular that index scans cannot return the same index tuple twice -
> - processing a page at a time ensures that that cannot happen.
> 
> Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> TID) to the executor more than once?
> 

The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a
problem.Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that
isn'table to handle this?
 

-Floris

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

22 January 2020, 21:36:03

> On Wed, Jan 22, 2020 at 05:24:43PM +0000, Floris Van Nee wrote:
>
> > > Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the
pageit has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen.
Supposewe have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
 
> > > We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned
butunlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to
befull. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of
thesituations we could be left with is:
 
> > > Leaf page 1 = [2,4]
> > > Leaf page 2 = [5,6,8]
> > > However, our scan is still pointing to leaf page 1.
> 
> > In case if we just returned a tuple, the next action would be either
> >> check the next page for another key or search down to the tree. Maybe
> 
> But it won't look at the 'next page for another key', but rather at the 'same page or another key', right? In the
_bt_scankey_within_pageshortcut we're taking, there's no stepping to a next page involved. It just locks the page again
thatit previously also locked.
 

Yep, it would look only on the same page. Not sure what do you mean by
"another key", if the current key is not found within the current page
at the first stage, we restart from the root.

> > I'm missing something in your scenario, but the latter will land us on a
> > required page (we do not point to any leaf here), and before the former
> > there is a check for high/low key. Is there anything else missing?
> 
> Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the
onlypage after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're
stillpointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so
there'sa page=2 with [5,6,8], however the ongoing index scan is unaware of that.
 
> Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan,
so->skipScanKey,so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key
ofthe page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding
item=4to be returned.
 
> 
> The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the
pagethat it's pointing to already.
 
> It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key
inthis example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer
instead.We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally
inthe buffer already. That way we never need to lock the same page twice.
 

Oh, that's what you mean. Yes, I was somehow tricked by the name of this
function and didn't notice that it checks only one boundary, so in case
of backward scan it returns wrong result. I think in the situation
you've describe it would actually not find any item on the current page
and restart from the root, but nevertheless we need to check for both
keys in _bt_scankey_within_page.

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

22 January 2020, 21:37:16

> On Wed, Jan 22, 2020 at 09:08:59PM +0000, Floris Van Nee wrote:
> > Note in particular that index scans cannot return the same index tuple twice -
> > - processing a page at a time ensures that that cannot happen.
> >
> > Can a loose index scan return the same tuple (i.e. a tuple with the same heap
> > TID) to the executor more than once?
> >
>
> The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a
problem.

Yes, it shouldn't happen.

Re: Index Skip Scan

From

Peter Geoghegan

Date:

22 January 2020, 22:23:36

On Wed, Jan 22, 2020 at 1:09 PM Floris Van Nee <florisvannee@optiver.com> wrote:
> The loose index scan shouldn't return a tuple twice. It should only be able to skip 'further', so that shouldn't be a
problem.Out of curiosity, why can't index scans return the same tuple twice? Is there something in the executor that
isn'table to handle this? 

I have no reason to believe that the executor has a problem with index
scans that return a tuple more than once, aside from the very obvious:
in general, that will often be wrong. It might not be wrong when the
scan happens to be input to a unique node anyway, or something like
that.

I'm not particularly concerned about it. Just wanted to be clear on
our assumptions for loose index scans -- if loose index scans were
allowed to return a tuple more than once, then that would at least
have to at least be considered in the wider context of the executor
(but apparently they're not, so no need to worry about it). This may
have been mentioned somewhere already. If it is then I must have
missed it.

--
Peter Geoghegan

Re: Index Skip Scan

From

Peter Geoghegan

Date:

22 January 2020, 23:13:21

On Wed, Jan 22, 2020 at 1:35 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

I suggest reading the nbtree README file's description of backwards
scans. Read the paragraph that begins with 'We support the notion of
an ordered "scan" of an index...'. I also suggest that you read a bit
of the stuff in the large section on page deletion. Certainly read the
paragraph that begins with 'Moving left in a backward scan is
complicated because...'.

It's important to grok why it's okay that we don't "couple" or "crab"
buffer locks as we descend the tree with Lehman & Yao's design -- we
can get away with having *no* interlock against page splits (e.g.,
pin, buffer lock) when we are "between" levels of the tree. This is
safe, since the page that we land on must still be "substantively the
same page", no matter how much time passes. That is, it must at least
cover the leftmost portion of the keyspace covered by the original
version of the page that we saw that we needed to descend to within
the parent page. The worst that can happen is that we have to recover
from a concurrent page split by moving right one or more times.
(Actually, page deletion can change the contents of a page entirely,
but that's not really an exception to the general rule -- page
deletion is careful about recycling pages that an in flight index scan
might land on.)

Lehman & Yao don't have backwards scans (or left links, or page
deletion). Unlike nbtree. This is why the usual Lehman & Yao
guarantees don't quite work with backward scans. We must therefore
compensate as described by the README file (basically, we check and
re-check for races, possibly returning to the original page when we
think that we might have overlooked something and need to make sure).
It's an exception to the general rule, you could say.

--
Peter Geoghegan

Re: Index Skip Scan

From

Peter Geoghegan

Date:

23 January 2020, 00:30:10

On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
> <florisvannee@optiver.com> wrote:
> > As far as I know, a page split could happen at any random element in the page. One of the situations we could be
leftwith is:

> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read
allmatching elements in our local buffer and we'll return those. But the skip scan currently:

> > a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
-- 
Peter Geoghegan

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

27 January 2020, 10:30:16

> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote:
>
> > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's
theonly page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example.
We'restill pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right,
sothere's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.

> > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan,
so->skipScanKey,so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key
ofthe page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding
item=4to be returned.

> >
> > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the
pagethat it's pointing to already.

> > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key
inthis example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer
instead.We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally
inthe buffer already. That way we never need to lock the same page twice.

>
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.

Thanks again everyone for commentaries and clarification. Here is the
version, where hopefully I've addressed all the mentioned issues.

As mentioned in the _bt_skip commentaries, before we were moving left to
check the next page to avoid significant issues in case if ndistinct was
underestimated and we need to skip too often. To make it work safe in
presense of splits we need to remember an original page and move right
again until we find a page with the right link pointing to it. It's not
clear whether it's worth to increase complexity for such sort of "edge
case" with ndistinct estimation while moving left, so at least for now
we ignore this in the implementation and just start from the root
immediately.

Offset based code in moving forward/reading backward was replaced with
remembering a start index tuple and an attempt to find it on the new
page. Also a missing page lock before _bt_scankey_within_page was added
and _bt_scankey_within_page checks for both page boundaries.

> On Thu, Feb 06, 2020 at 09:22:20PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 6 Feb 2020 11:57:07 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in 
> > > On Thu, Feb 06, 2020 at 10:24:50AM +0900, Kyotaro Horiguchi wrote:
> > > At Wed, 5 Feb 2020 17:37:30 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
> > > We could add an additional parameter "in_cursor" to
> > > ExecSupportBackwardScan and let skip scan return false if in_cursor is
> > > true, but I'm not sure it's acceptable.
> > 
> > I also was thinking about whether it's possible to use
> > ExecSupportBackwardScan here, but skip scan is just a mode of an
> > index/indexonly scan. Which means that ExecSupportBackwardScan also need
> > to know somehow if this mode is being used, and then, since this
> > function is called after it's already decided to use skip scan in the
> > resulting plan, somehow correct the plan (exclude skipping and try to
> > find next best path?) - do I understand your suggestion correct?
> 
> I didn't thought so hardly, but a bit of confirmation told me that
> IndexSupportsBackwardScan returns fixed flag for AM.  It seems that
> things are not that simple.

Yes, I've mentioned that already in one of the previous emails :) The
simplest way I see to achieve what we want is to do something like in
attached modified version with a new hasDeclaredCursor field. It's not a
final version though, but posted just for discussion, so feel free to
suggest any improvements or alternatives.

> On Fri, Feb 14, 2020 at 01:18:20PM +0100, Dmitry Dolgov wrote:
> > On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote:
> > The first attached (renamed to .txt not to confuse the cfbots) is a
> > small patch that makes sure if _bt_readpage is called with the proper
> > condition as written in its comment, that is, caller must have pinned
> > and read-locked so->currPos.buf. This patch reveals many instances of
> > breakage of the contract.
>
> Thanks! On top of which patch version one can apply it? I'm asking
> because I believe I've addressed similar issues in the last version, and
> the last proposed diff (after resolving some conflicts) breaks tests for
> me, so not sure if I miss something.
>
> At the same time if you and Tomas strongly agree that it actually makes
> sense to make moving forward/reading backward case work with dead tuples
> correctly, I'll take a shot and try to teach the code around _bt_skip to
> do what is required for that. I can merge your changes there and we can
> see what would be the result.

Here is something similar to what I had in mind. In this version of the
patch IndexOnlyNext now verify if we returned to the same position as
before while reading in opposite to the advancing direction due to
visibility checks (similar to what is implemented inside _bt_skip for
the situation when some distinct keys being eliminated due to scankey
conditions). It's actually not that invasive as I feared, but still
pretty hacky. I'm not sure if it's ok to compare resulting heaptid in
this situation, but all the mention tests are passing. Also, this version
doesn't incorporate any planner feedback from Tomas yet, my intention is
just to check if it could be the right direction.

I think the UniqueKeys may need to be changed from using
EquivalenceClasses to use Exprs instead.

When I try to understand why UniqueKeys needs EquivalenceClasses,

see your comments here. I feel that FuncExpr can't be

used to as a UniquePath even we can create unique index on f(a)

and f->strict == true. The reason is even we know a is not null,

f->strict = true. it is still be possible that f(a) == null. unique index

allows more than 1 null values. so shall we move further to use varattrno

instead of Expr? if so, we can also use a list of Bitmapset to present multi

unique path of a single RelOptInfo.

Re: Index Skip Scan

From

Andy Fan

Date:

11 March 2020, 10:56:09

On Tue, Mar 10, 2020 at 4:32 AM James Coleman <jtc331@gmail.com> wrote:

On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Assuming we'll implement it in a way that we do not know about what kind
> of path type is that in create_distinct_path, then it can also work for
> ProjectionPath or anything else (if UniqueKeys are present). But then
> still EquivalenceMember are used only to figure out correct
> distinctPrefixKeys and do not affect whether or not skipping is applied.
> What do I miss?

Part of the puzzle seems to me to this part of the response:

> I think the UniqueKeys may need to be changed from using
> EquivalenceClasses to use Exprs instead.

But I can't say I'm being overly helpful by pointing that out, since I
don't have my head in the code enough to understand how you'd
accomplish that :)

There was a dedicated thread [1] where David explain his idea very detailed,

and you can also check that messages around that message for the context.

hope it helps.

[1] https://www.postgresql.org/message-id/CAApHDvq7i0%3DO97r4Y1pv68%2BtprVczKsXRsV28rM9H-rVPOfeNQ%40mail.gmail.com

Re: Index Skip Scan

From

David Rowley

Date:

12 March 2020, 02:08:18

On Wed, 11 Mar 2020 at 16:44, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>>
>> I think the UniqueKeys may need to be changed from using
>> EquivalenceClasses to use Exprs instead.
>
>
> When I try to understand why UniqueKeys needs EquivalenceClasses,
> see your comments here.  I feel that  FuncExpr can't be
> used to as a UniquePath even we can create unique index on f(a)
> and f->strict == true.  The reason is even we know a is not null,
>  f->strict = true.  it is still be possible that f(a) == null.  unique index
> allows more than 1 null values.  so shall we move further to use varattrno
> instead of Expr?  if so,  we can also use a list of Bitmapset to present multi
> unique path of a single RelOptInfo.

We do need some method to determine if NULL values are possible. At
the base relation level that can probably be done by checking NOT NULL
constraints and strict base quals. At higher levels, we can use strict
join quals as proofs.

As for bit a Bitmapset of varattnos, that would certainly work well at
the base relation level when there are no unique expression indexes,
but it's not so simple with join relations when the varattnos only
mean something when you know which base relation it comes from.  I'm
not saying that Lists of Exprs is ideal, but I think trying to
optimise some code that does not yet exist is premature.

There was some other talk in [1] on how we might make checking if a
List contains a given Node.  That could be advantageous in a few
places in the query planner, and it might be useful for this too.

[1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com

RE: Index Skip Scan

From

Floris Van Nee

Date:

21 March 2020, 22:00:01

Hello hackers,

Recently I've put some effort in extending the functionality of this patch. So far, we've been trying to keep the scope
ofthis patch relatively small to DISTINCT-clauses only. The advantage of this approach was that it keeps impact to the
indexamapi to a minimum. However, given the problems we've been facing in getting the implementation to work correctly
inall cases, I started wondering if this implementation was the right direction to go in. My main worry is that the
currentindexam api for skipping is not suited to other future use cases of skipping, but also that we're already
strugglingwith it now to get it to work correctly in all edge cases.
 

In the approach taken so far, the amskip function is defined taking two ScanDirection parameters. The function
amgettupleis left unchanged. However, I think we need amgettuple to take two ScanDirection parameters as well (or
createa separate function amgetskiptuple). This patch proposes that.
 

Currently, I've just added 'skip' functions to the indexam api for beginscan and gettuple. Maybe it'd be better to just
modifythe existing functions to take an extra parameter instead. Any thoughts on this?
 

The result is a patch that can apply skipping in many more cases than previous patches. For example, filtering on
columnsthat are not part of the index, properly handling visibility checks without moving these into the nbtree code,
skippingnot only on prefix but also on extra conditions that become available (eg. prefix a=1 and we happen to have a
WHEREclause with b=200, which we can now use to skip all the way to a=1 AND b=200). There's a fair amount of changes in
thenbtree code to support this.
 

Patch 0001 is Jesper's unique keys patch.
Patch 0002 modifies executor-level code to support skip scans and enables it for DISTINCT queries.
Patch 0003 just provides a very basic planner hack that enables the skip scan for practically all index scans (index,
indexonly and bitmap index). This is not what we would normally want, but this way I could easily test the skip scan
code.It's so large because I modify all the test cases expected results that now include an extra line 'Skip scan:
All'.The actual code changed is only a few lines in this patch.
 

The planner part of the code still needs work. The planner code in this patch is similar to the previous patch. David's
commentsabout projection paths haven't been addressed yet. Also, there's no proper way of hooking up the index scan for
regular(non-DISTINCT) queries yet. That's why I hacked up patch 0003 just to test stuff.
 

I'd welcome any help on these patches. If someone with more planner knowledge than me is willing to do part of the
plannercode, please feel free to do so. I believe supporting this will speed up a large number of queries for all kinds
ofusers. It can be a really powerful feature.
 

Tomas, would you be willing to repeat the performance tests you did earlier? I believe this version will perform better
thanthe previous patch for the cases where you noticed the 10-20x slow-down. There will obviously still be a
performancepenalty for cases where the planner picks a skip scan that are not well suited, but I think it'll be
smaller.

-Floris

-----
To give a few simple examples:

Initialization:
-- table t1 has 100 unique values for a
-- and 10000 b values for each a
-- very suitable for skip scan
create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 100) a, generate_series(1,10000) b;
create index on t1 (a,b,c);

-- table t2 has 10000 unique values for a
-- and 100 b values for each a 
-- this is not very suitable for skip scan
-- because the next matching value is always either
-- on the current page or on the next page
create table t2 as select a,b,b%5 as c, random() as d from generate_series(1, 10000) a, generate_series(1,100) b;
create index on t2 (a,b,c);

analyze t1;
analyze t2;

-- First 'Execution Time' line is this patched version (0001+0002+0003) (without including 0003, the non-DISTINCT
querieswould be equal to master)
 
-- Second 'Execution Time' line is master
-- Third 'Execution Time' is previous skip scan patch version
-- Just ran a couple of times to give an indication 
-- on order of magnitude, not a full benchmark.
select distinct on (a) * from t1;
 Execution Time: 1.407 ms (current patch)
 Execution Time: 480.704 ms (master)
 Execution Time: 1.711 ms (previous patch)

select distinct on (a) * from t1 where b > 50;
 Execution Time: 1.432 ms
 Execution Time: 481.530 ms
 Execution Time: 481.206 ms

select distinct on (a) * from t1 where b > 9990;
 Execution Time: 1.074 ms
 Execution Time: 33.937 ms
 Execution Time: 33.115 ms

select distinct on (a) * from t1 where d > 0.5;
 Execution Time: 0.811 ms
 Execution Time: 446.549 ms
 Execution Time: 436.091 ms

select * from t1 where b=50;
 Execution Time: 1.111 ms
 Execution Time: 33.242 ms
 Execution Time: 36.555 ms

select * from t1 where b between 50 and 75 and d > 0.5;
 Execution Time: 2.370 ms
 Execution Time: 60.744 ms
 Execution Time: 62.820 ms

select * from t1 where b in (100, 200);
 Execution Time: 2.464 ms
 Execution Time: 252.224 ms
 Execution Time: 244.872 ms

select * from t1 where b in (select distinct a from t1);
 Execution Time: 91.000 ms
 Execution Time: 842.969 ms
 Execution Time: 386.871 ms

select distinct on (a) * from t2;
 Execution Time: 47.155 ms
 Execution Time: 714.102 ms
 Execution Time: 56.327 ms

select distinct on (a) * from t2 where b > 5;
 Execution Time: 60.100 ms
 Execution Time: 709.960 ms
 Execution Time: 727.949 ms

select distinct on (a) * from t2 where b > 95;
 Execution Time: 55.420 ms
 Execution Time: 71.007 ms
 Execution Time: 69.229 ms

select distinct on (a) * from t2 where d > 0.5;
 Execution Time: 49.254 ms
 Execution Time: 719.820 ms
 Execution Time: 705.991 ms

-- slight performance degradation here compared to regular index scan
-- due to data unfavorable data distribution
select * from t2 where b=50;
 Execution Time: 47.603 ms
 Execution Time: 37.327 ms
 Execution Time: 40.448 ms

select * from t2 where b between 50 and 75 and d > 0.5;
 Execution Time: 244.546 ms
 Execution Time: 228.579 ms
 Execution Time: 227.541 ms

select * from t2 where b in (100, 200);
 Execution Time: 64.021 ms
 Execution Time: 242.905 ms
 Execution Time: 258.864 ms

select * from t2 where b in (select distinct a from t2);
 Execution Time: 758.350 ms
 Execution Time: 1271.230 ms
 Execution Time: 761.311 ms

I wrote a few things here about the method as well:
https://github.com/fvannee/postgres/wiki/Index-Skip-Scan
Code can be found there on Github as well in branch 'skip-scan'

Attachment

RE: Index Skip Scan

From

Floris Van Nee

Date:

22 March 2020, 12:55:29

It seems that the documentation build was broken. I've fixed it in attached patch.

I'm unsure which version number to give this patch (to continue with numbers from previous skip scan patches, or to
startnumbering from scratch again). It's a rather big change, so one could argue it's mostly a separate patch. I guess
itmostly depends on how close the original versions were to be committable. Thoughts?
 

-Floris

Attachment

Re: Index Skip Scan

From

Thomas Munro

Date:

22 March 2020, 22:23:32

Hi Floris,

On Sun, Mar 22, 2020 at 11:00 AM Floris Van Nee
<florisvannee@optiver.com> wrote:
> create index on t1 (a,b,c);

> select * from t1 where b in (100, 200);
>  Execution Time: 2.464 ms
>  Execution Time: 252.224 ms
>  Execution Time: 244.872 ms

Wow.  This is very cool work and I'm sure it will become a major
headline feature of PG14 if the requisite planner brains can be sorted
out.

On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee <florisvannee@optiver.com> wrote:
> I'm unsure which version number to give this patch (to continue with numbers from previous skip scan patches, or to
startnumbering from scratch again). It's a rather big change, so one could argue it's mostly a separate patch. I guess
itmostly depends on how close the original versions were to be committable. Thoughts? 

I don't know, but from the sidelines, it'd be nice to see the unique
path part go into PG13, where IIUC it can power the "useless unique
removal" patch.

Re: Index Skip Scan

From

Andy Fan

Date:

23 March 2020, 11:17:56

On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee <florisvannee@optiver.com> wrote:
> I'm unsure which version number to give this patch (to continue with numbers from previous skip scan patches, or to start numbering from scratch again). It's a rather big change, so one could argue it's mostly a separate patch. I guess it mostly depends on how close the original versions were to be committable. Thoughts?

I don't know, but from the sidelines, it'd be nice to see the unique
path part go into PG13, where IIUC it can power the "useless unique
removal" patch.

Actually I have a patch to remove the distinct clause some long time ago[1],

and later it came to the UniqueKey as well, you can see [2] for the current

status.

[1] https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107

[2] https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

24 March 2020, 16:39:19

> On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote:
>
> Yes, I was complaining that a ProjectionPath breaks the optimisation
> and I don't believe there's any reason that it should.
>
> I believe the way to make that work correctly requires paying
> attention to the Path's uniquekeys rather than what type of path it
> is.

Thanks for the suggestion. As a result of the discussion I've modified
the patch, does it look similar to what you had in mind?

In this version if all conditions are met and there are corresponding
unique keys, a new index skip scan path will be added to
unique_pathlists. In case if requested distinct clauses match with
unique keys, create_distinct_paths can choose this path without needen
to know what kind of path is it. Also unique_keys are passed through
ProjectionPath, so optimization for the example mentioned in this thread
before now should work (I've added one test for that).

I haven't changed anything about UniqueKey structure itself (one of the
suggestions was about Expr instead of EquivalenceClass), but I believe
we need anyway to figure out how two existing imlementation (in this
patch and from [1]) of this idea can be connected.

[1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

Attachment

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

24 March 2020, 16:42:21

> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
>
> There was a dedicated thread  [1]  where David explain his idea very
> detailed, and you can also check that messages around that message for
> the context. hope it helps.

Thanks for pointing out to this thread! Somehow I've missed it, and now
looks like we need to make some efforts to align patches for index skip
scan and distincClause elimination.

Re: Index Skip Scan

From

Andy Fan

Date:

25 March 2020, 00:39:29

On Wed, Mar 25, 2020 at 12:41 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
>
> There was a dedicated thread [1] where David explain his idea very
> detailed, and you can also check that messages around that message for
> the context. hope it helps.

Thanks for pointing out to this thread! Somehow I've missed it, and now
looks like we need to make some efforts to align patches for index skip
scan and distincClause elimination.

Yes:). Looks Index skip scan is a way of make a distinct result without a real

distinct node, which happens after the UniqueKeys check where I try to see if

the result is unique already and before the place where create a unique node

for distinct node(With index skip scan we don't need it all). Currently in my patch,

the logical here is 1). Check the UniqueKey to see if the result is not unique already.

if not, go to next 2). After the distinct paths are created, I will add the result of distinct

path as a unique key. Will you add the index skip scan path during create_distincts_paths

and add the UniqueKey to RelOptInfo? if so I guess my current patch can handle it since

it cares about the result of distinct path but no worried about how we archive that.

Best Regards

Andy Fan

Re: Index Skip Scan

From

Dilip Kumar

Date:

25 March 2020, 06:01:56

On Tue, Mar 24, 2020 at 10:08 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote:
> >
> > Yes, I was complaining that a ProjectionPath breaks the optimisation
> > and I don't believe there's any reason that it should.
> >
> > I believe the way to make that work correctly requires paying
> > attention to the Path's uniquekeys rather than what type of path it
> > is.
>
> Thanks for the suggestion. As a result of the discussion I've modified
> the patch, does it look similar to what you had in mind?
>
> In this version if all conditions are met and there are corresponding
> unique keys, a new index skip scan path will be added to
> unique_pathlists. In case if requested distinct clauses match with
> unique keys, create_distinct_paths can choose this path without needen
> to know what kind of path is it. Also unique_keys are passed through
> ProjectionPath, so optimization for the example mentioned in this thread
> before now should work (I've added one test for that).
>
> I haven't changed anything about UniqueKey structure itself (one of the
> suggestions was about Expr instead of EquivalenceClass), but I believe
> we need anyway to figure out how two existing imlementation (in this
> patch and from [1]) of this idea can be connected.
>
> [1]:
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

---
 src/backend/nodes/outfuncs.c          | 14 ++++++
 src/backend/nodes/print.c             | 39 +++++++++++++++
 src/backend/optimizer/path/Makefile   |  3 +-
 src/backend/optimizer/path/allpaths.c |  8 +++
 src/backend/optimizer/path/indxpath.c | 41 ++++++++++++++++
 src/backend/optimizer/path/pathkeys.c | 71 ++++++++++++++++++++++-----
 src/backend/optimizer/plan/planagg.c  |  1 +
 src/backend/optimizer/plan/planmain.c |  1 +
 src/backend/optimizer/plan/planner.c  | 37 +++++++++++++-
 src/backend/optimizer/util/pathnode.c | 46 +++++++++++++----
 src/include/nodes/nodes.h             |  1 +
 src/include/nodes/pathnodes.h         | 19 +++++++
 src/include/nodes/print.h             |  1 +
 src/include/optimizer/pathnode.h      |  2 +
 src/include/optimizer/paths.h         | 11 +++++
 15 files changed, 272 insertions(+), 23 deletions(-)

Seems like you forgot to add the uniquekey.c file in the
v33-0001-Unique-key.patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

25 March 2020, 08:51:00

> On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote:
>
> Seems like you forgot to add the uniquekey.c file in the
> v33-0001-Unique-key.patch.

Oh, you're right, thanks. Here is the corrected patch.

> On Mon, Apr 06, 2020 at 06:31:08PM +0000, Floris Van Nee wrote:
>
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I
> > would appreciate if in the future you can make it more visible that changes you
> > suggest contain some fixes. E.g. it wasn't clear for me from your previous email
> > that that's the case, and it doesn't make sense to pull into different direction
> > when we're trying to achieve the same goal :)
>
> I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned
ithere of course. It's just that because my patch handles filter conditions in general, it works for this case too.

Oh, then fortunately I've got a wrong impression, sorry and thanks for
clarification :)

> > > > In the patch I posted a week ago these cases are all handled
> > > > correctly, as it introduces this extra logic in the Executor.
> > >
> > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> >
> > I'll definitely take a look at suggested changes in filtering part.
>
> It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to
pullthe information about skipping one level up, out of the node, into the generic IndexNext code.

I was actually thinking more about just preventing skip scan in this
situation, which is if I'm not mistaken could be solved by inspecting
qual conditions to figure out if they're covered in the index -
something like in attachments (this implementation is actually too
restrictive in this sense and will not allow e.g. expressions, that's
why I haven't bumped patch set version for it - soon I'll post an
extended version).

Other than that to summarize current open points for future readers
(this thread somehow became quite big):

* Making UniqueKeys usage more generic to allow using skip scan for more
  use cases (hopefully it was covered by the v33, but I still need a
  confirmation from David, like blinking twice or something).

* Suspicious performance difference between different type of workload,
  mentioned by Tomas (unfortunately I had no chance yet to investigate).

* Thinking about supporting conditions, that are not covered by the index,
  to make skipping more flexible (one of the potential next steps in the
  future, as suggested by Floris).

On Fri, 15 May 2020 at 6:06 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote:
>
> + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> + {
>
> Here we expect whether the "next" unique key can be found on this page
> or not, but we are using the function which suggested whether the
> "current" key can be found on this page or not. I think in boundary
> cases where the high key is equal to the current key, this function
> will return true (which is expected from this function), and based on
> that we will simply scan the current page and IMHO that cost could be
> avoided no?

Yes, looks like you're right, there is indeed an unecessary extra scan
happening. To avoid that we can see the key->nextkey and adjust higher
boundary correspondingly. Will also add this into the next rebased
patch, thanks!

Great thanks!

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Index Skip Scan

From

Andy Fan

Date:

02 June 2020, 12:36:31

On Wed, Apr 8, 2020 at 3:41 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Mon, Apr 06, 2020 at 06:31:08PM +0000, Floris Van Nee wrote:
>
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I
> > would appreciate if in the future you can make it more visible that changes you
> > suggest contain some fixes. E.g. it wasn't clear for me from your previous email
> > that that's the case, and it doesn't make sense to pull into different direction
> > when we're trying to achieve the same goal :)
>
> I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned it here of course. It's just that because my patch handles filter conditions in general, it works for this case too.

Oh, then fortunately I've got a wrong impression, sorry and thanks for
clarification :)

> > > > In the patch I posted a week ago these cases are all handled
> > > > correctly, as it introduces this extra logic in the Executor.
> > >
> > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> >
> > I'll definitely take a look at suggested changes in filtering part.
>
> It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pull the information about skipping one level up, out of the node, into the generic IndexNext code.

I was actually thinking more about just preventing skip scan in this
situation, which is if I'm not mistaken could be solved by inspecting
qual conditions to figure out if they're covered in the index -
something like in attachments (this implementation is actually too
restrictive in this sense and will not allow e.g. expressions, that's
why I haven't bumped patch set version for it - soon I'll post an
extended version).

Other than that to summarize current open points for future readers
(this thread somehow became quite big):

* Making UniqueKeys usage more generic to allow using skip scan for more
use cases (hopefully it was covered by the v33, but I still need a
confirmation from David, like blinking twice or something).

* Suspicious performance difference between different type of workload,
mentioned by Tomas (unfortunately I had no chance yet to investigate).

* Thinking about supporting conditions, that are not covered by the index,
to make skipping more flexible (one of the potential next steps in the
future, as suggested by Floris).

Looks this is the latest patch, which commit it is based on? Thanks

Best Regards

Andy Fan

Re: Index Skip Scan

From

Dmitry Dolgov

Date:

02 June 2020, 13:40:06

> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
>
> > Other than that to summarize current open points for future readers
> > (this thread somehow became quite big):
> >
> > * Making UniqueKeys usage more generic to allow using skip scan for more
> >   use cases (hopefully it was covered by the v33, but I still need a
> >   confirmation from David, like blinking twice or something).
> >
> > * Suspicious performance difference between different type of workload,
> >   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> >
> > * Thinking about supporting conditions, that are not covered by the index,
> >   to make skipping more flexible (one of the potential next steps in the
> >   future, as suggested by Floris).
> >
>
> Looks this is the latest patch,  which commit it is based on?  Thanks

I have a rebased version, if you're about it. Didn't posted it yet
mostly since I'm in the middle of adapting it to the UniqueKeys from
other thread. Would it be ok for you to wait a bit until I'll post
finished version?

Re: Index Skip Scan

From

Andy Fan

Date:

02 June 2020, 23:17:59

On Tue, Jun 2, 2020 at 9:38 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
>
> > Other than that to summarize current open points for future readers
> > (this thread somehow became quite big):
> >
> > * Making UniqueKeys usage more generic to allow using skip scan for more
> > use cases (hopefully it was covered by the v33, but I still need a
> > confirmation from David, like blinking twice or something).
> >
> > * Suspicious performance difference between different type of workload,
> > mentioned by Tomas (unfortunately I had no chance yet to investigate).
> >
> > * Thinking about supporting conditions, that are not covered by the index,
> > to make skipping more flexible (one of the potential next steps in the
> > future, as suggested by Floris).
> >
>
> Looks this is the latest patch, which commit it is based on? Thanks

I have a rebased version, if you're about it. Didn't posted it yet
mostly since I'm in the middle of adapting it to the UniqueKeys from
other thread. Would it be ok for you to wait a bit until I'll post
finished version?

Sure, that's OK. The discussion on UniqueKey thread looks more complex

than what I expected, that's why I want to check the code here, but that's fine,

you can work on your schedule.

Best Regards

Andy Fan