Thread: DISTINCT with btree skip scan

DISTINCT with btree skip scan

From
Thomas Munro
Date:
Hello

As an exercise I hacked up the simplest code I could think of that would demonstrate a faster DISTINCT based on skipping ahead to the next distinct value in an index-only scan. Please see the attached (extremely buggy) patch, and the example session below.  (It's against my natural instinct to send such half-baked early hacking phase code to the list, but thought it would make sense to demo the concept and then seek advice, warnings, cease and desist notices etc before pressing on down that route!)  I would be most grateful for any feedback you might have.

Best regards,

Thomas Munro

postgres=# create table foo (a int, b int, primary key (a, b));
CREATE TABLE
Time: 9.002 ms
postgres=# insert into foo select generate_series(1, 5000000) % 10, generate_series(1, 5000000);
INSERT 0 5000000
Time: 35183.444 ms
postgres=# analyze foo;
ANALYZE
Time: 493.168 ms

postgres=# set enable_hashagg = true;
SET
Time: 0.274 ms
postgres=# explain select distinct a from foo;
┌───────────────────────────────────────────────────────────────────┐
│                            QUERY PLAN                             │
├───────────────────────────────────────────────────────────────────┤
│ HashAggregate  (cost=84624.00..84624.10 rows=10 width=4)          │
│   Group Key: a                                                    │
│   ->  Seq Scan on foo  (cost=0.00..72124.00 rows=5000000 width=4) │
│ Planning time: 0.065 ms                                           │
└───────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 0.500 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 6 │
│ 8 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 9 │
│ 7 │
│ 0 │
└───┘
(10 rows)

Time: 2017.019 ms

postgres=# set enable_hashagg = false;
SET
Time: 0.302 ms
postgres=# explain select distinct a from foo;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                             QUERY PLAN                                              │
├─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan for distinct prefix 1 using foo_pkey on foo  (cost=0.43..263354.20 rows=10 width=4) │
│ Planning time: 0.063 ms                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
(2 rows)

Time: 0.443 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)

Time: 0.565 ms
Attachment

Re: DISTINCT with btree skip scan

From
Vik Fearing
Date:
On 07/05/2014 02:17 AM, Thomas Munro wrote:
> As an exercise I hacked up the simplest code I could think of that would
> demonstrate a faster DISTINCT based on skipping ahead to the next
> distinct value in an index-only scan. Please see the attached (extremely
> buggy) patch, and the example session below.  (It's against my natural
> instinct to send such half-baked early hacking phase code to the list,
> but thought it would make sense to demo the concept and then seek
> advice, warnings, cease and desist notices etc before pressing on down
> that route!)  I would be most grateful for any feedback you might have.

This is called a Loose Index Scan in our wiki[1] which I believe is
based on the terminology used for the same feature in MySQL although I
haven't and shan't check.

This is a feature I would really like to have.

Your benchmarks look promising but unfortunately it blows up for me so I
can't really test it.  I have not yet read the patch nor debugged to see
why, but please do keep up work on this.  People more expert than I can
tell you whether you're implementing it the right way.

[1] http://wiki.postgresql.org/wiki/Loose_indexscan
-- 
Vik



Re: DISTINCT with btree skip scan

From
Thomas Munro
Date:
On 5 July 2014 02:03, Vik Fearing <vik.fearing@dalibo.com> wrote:
> [1] http://wiki.postgresql.org/wiki/Loose_indexscan

Thanks. It talks about DISTINCT, and also about using index when you
don't have the leading column in your WHERE clause (as well as MySQL
("loose"), at least Oracle ("skip"), SQLite ("skip"), DB2 ("jump") can
do this).  It looks like at least MySQL can also use loose index scans
to implement GROUP BY in certain cases involving MIN or MAX aggregate
functions (things like SELECT a, MIN(b) FROM foo GROUP BY a, given an
index on (a, b)).

But I'm only trying to implement the lowest hanging index skipping
plan: plain old DISTINCT.  I think I see roughly how to cost, plan and
execute it...  now to learn a lot more about PG internals...



Re: DISTINCT with btree skip scan

From
David Rowley
Date:


On Sat, Jul 5, 2014 at 12:17 PM, Thomas Munro <munro@ip9.org> wrote:
postgres=# set enable_hashagg = false;
SET
Time: 0.302 ms
postgres=# explain select distinct a from foo;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                             QUERY PLAN                                              │
├─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan for distinct prefix 1 using foo_pkey on foo  (cost=0.43..263354.20 rows=10 width=4) │
│ Planning time: 0.063 ms                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
(2 rows)

Time: 0.443 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)

Time: 0.565 ms



Hi Thomas,

I've had a quick look at this and it seems like a great win! I'm quite surprised that we've not got this already. I think this technology could also really help performance of queries such as SELECT * from bigtable bt WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol = obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve that first off, but it could be done later once the benefits of this are more commonly realised.

I think our shortfalls in this area have not gone unnoticed. I was reading this post https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html about comparing performance of COUNT(DISTINCT col). I think this would give a big win for query 3 in that post. I'm trying to find some time make some changes to transform queries to allow the group by to happen before the joins when possible, so between that and your patch we'd be 2 steps closer to making query 1 in the link above perform a little better on PostgreSQL.

Do you think you'll manage to get time to look at this a bit more? I'd be keen to look into the costing side of it if you think that'll help you any?

Regards

David Rowley

Re: DISTINCT with btree skip scan

From
Thomas Munro
Date:
On 27 October 2014 20:24, David Rowley <dgrowleyml@gmail.com> wrote:
> I've had a quick look at this and it seems like a great win! I'm quite
> surprised that we've not got this already. I think this technology could
> also really help performance of queries such as SELECT * from bigtable bt
> WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol =
> obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve
> that first off, but it could be done later once the benefits of this are
> more commonly realised.
>
> I think our shortfalls in this area have not gone unnoticed. I was reading
> this post
> https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
> about comparing performance of COUNT(DISTINCT col). I think this would give
> a big win for query 3 in that post. I'm trying to find some time make some
> changes to transform queries to allow the group by to happen before the
> joins when possible, so between that and your patch we'd be 2 steps closer
> to making query 1 in the link above perform a little better on PostgreSQL.
>
> Do you think you'll manage to get time to look at this a bit more? I'd be
> keen to look into the costing side of it if you think that'll help you any?

Hi David,

Sorry for the silence, I have been busy moving countries.

I am definitely interested in collaborating on a series of patches to
implement various kinds of skip-based plans as seen in other RDBMSs if
others think it could be useful.  I see skip-based DISTINCT as a good
place to start.  (I suspect the semi-related skip scan plan (for the
case when your WHERE clause doesn't have a qual for the leading
column(s)) is much harder to plan and execute and I also suspect it's
less generally useful).

Here is a rebased version of that patch which fixes a crashing
off-by-one error, and handles backward scans properly, I think.  This
is still just a quick hack to play with the general idea at this
stage.

It works by adding a new index operation 'skip' which the executor
code can use during a scan to advance to the next value (for some
prefix of the index's columns).  That may be a terrible idea and
totally unnecessary... but let me explain my
reasoning:

1.  Perhaps some index implementations can do something better than a
search for the next key value from the root.  Is it possible or
desirable to use the current position as a starting point for a btree
traversal?  I don't know.

2.  It seemed that I'd need to create a new search ScanKey to use the
'rescan' interface for skipping to the next value, but I already had
an insertion ScanKey so I wanted a way to just reuse that.  But maybe
there is some other way to reuse existing index interfaces, or maybe
there is an easy way to make a new search ScanKey from the existing
insertion ScanKey?

Currently it uses the existing IndexOnlyScan plan node, adding an
extra member.  I briefly tried making a different plan called
DistinctIndexScan but it seemed I was going to need to duplicate or
refactor/share a lot of IndexOnlyScan code (this might be the right
thing to do but at this stage I'm just trying to demo the general idea
with minimal changes).

As for costing, I haven't looked at that part of PG at all yet.  If
you *can* do a distinct index scan, would you *always* want to?  How
well could we estimate the cost using existing statistics?  I suppose
the expected number of page fetches is proportional to the expected
number of distinct values (ie skip operations) times btree depth, and
the cost may be adjusted in some way for cache effects (?), but how
well can we estimate the number of distinct values (a), (a, b) or (a,
b, c) in some range for an index on (a, b, c, d)?

As before, you still need the kludge of disabling hashagg to reach the new plan:

postgres=# set enable_hashagg = true;
SET
Time: 0.213 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 6 │
│ 8 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 9 │
│ 7 │
│ 0 │
└───┘
(10 rows)

Time: 890.166 ms
postgres=# set enable_hashagg = false;
SET
Time: 0.271 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)

Time: 0.573 ms

Best regards,
Thomas Munro

Attachment

Re: DISTINCT with btree skip scan

From
Thomas Munro
Date:
On Sat, Nov 1, 2014 at 10:35 AM, Thomas Munro <munro@ip9.org> wrote:
> I am definitely interested in collaborating on a series of patches to
> implement various kinds of skip-based plans as seen in other RDBMSs if
> others think it could be useful.  I see skip-based DISTINCT as a good
> place to start.  (I suspect the semi-related skip scan plan (for the
> case when your WHERE clause doesn't have a qual for the leading
> column(s)) is much harder to plan and execute and I also suspect it's
> less generally useful).

There's an interesting thread about merge joins with a skip
optimisation[1] ("leapfrog trie-join", though I haven't read the paper
yet).  That reminded me to go and rebase this experimental patch which
is somewhat related.  Here it is, now split into two parts: one patch
to add an amskip operation, and another to consider using it to
implement DISTINCT when it was already has an index only scan path on
an index that supports skipping.  As you can see it's still sketchy
experiment-grade code, but it now has a first attempt at costing.

postgres=# select count(a) from foo;
┌─────────┐
│  count  │
├─────────┤
│ 5000000 │
└─────────┘
(1 row)

Time: 493.501 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)

Time: 0.516 ms
postgres=# explain select distinct a from foo;

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                     QUERY PLAN
                                               │

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan for distinct values of leading 1 column(s) using
foo_pkey on foo  (cost=0.43..4.33 rows=10 width=4) │

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

Time: 0.378 ms

[1] https://www.postgresql.org/message-id/flat/CAFQtOwp4e1hU6gKi1R_nHgOLTzUJko8EcaC1FdAV0rQFo1ARzA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: DISTINCT with btree skip scan

From
Thomas Munro
Date:
On Wed, Oct 12, 2016 at 4:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here it is, now split into two parts: one patch
> to add an amskip operation, and another to consider using it to
> implement DISTINCT when it was already has an index only scan path on
> an index that supports skipping.

Those patches add an amskip operation and then teach the planner and
executor how to do this, given a table foo (a, b) with an index on (a)
or (a, b):
 SELECT DISTINCT foo FROM a
 Index Only Scan for distinct values of (a) using foo_pkey on foo

In just the right circumstances that is vastly faster than scanning
every tuple and uniquifying.  I was thinking that the next step in
this experiment might be to introduce a kind of plan that can handle
queries where the index's leading column is not in the WHERE clause,
building on the above plan, and behaving conceptually a little bit
like a Nested Loop, like so:
 SELECT * FROM foo WHERE b = 42
 Index Skip Scan   ->  Index Only Scan for distinct values of (a) using foo_pkey on foo   ->  Index Only Scan using
foo_pkeyon foo         Index Cond: ((a = outer.a) AND (b = 42))
 

I suppose you could instead have a single node that knows how to
perform the whole scan itself so that you don't finish up searching
for each distinct value in both the inner and outer subplan, but it
occurred to me that building the plan out of Lego bricks like this
might generalise better to more complicated queries.  I suppose it
might also parallelise nicely with a Parallel Index Only Scan as the
outer plan.

Maybe something similar is possible for handling "GROUP BY a".

Worth pursuing?  Does amskip suck?  Does anyone have better ideas,
either for how to do the low level skip or the higher level Index Skip
Scan, or perhaps a completely different way of looking at this?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: DISTINCT with btree skip scan

From
Geoff Winkless
Date:
On 23 November 2016 at 21:19, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Worth pursuing?  Does amskip suck?  Does anyone have better ideas,
> either for how to do the low level skip or the higher level Index Skip
> Scan, or perhaps a completely different way of looking at this?

I have no helpful suggestions with how to achieve it, but can I add a
voice of encouragement: there have been a good few occasions in the
past year (we moved from another db to PG) where the lack of skip
scans has bitten us; in that case it was using MIN() and GROUP BY,
where the grouping column was the first element in the compound index
and the aggregate column was the second: in the Other DB that sort of
query was extremely quick, not so much here.

I was also idly pondering (in one of those moments of conceited
self-delusion where I thought I might actually have enough spare time
to try to work on it myself) whether it would be possible to implement
that sort of skip with two indexes (so MIN(a) GROUP BY b with separate
indexes on (a) and (b) rather than a single index (a,b)); I never got
much further than idle musings though.

Geoff



Re: DISTINCT with btree skip scan

From
Robert Haas
Date:
On Wed, Nov 23, 2016 at 4:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Oct 12, 2016 at 4:19 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Here it is, now split into two parts: one patch
>> to add an amskip operation, and another to consider using it to
>> implement DISTINCT when it was already has an index only scan path on
>> an index that supports skipping.
>
> Those patches add an amskip operation and then teach the planner and
> executor how to do this, given a table foo (a, b) with an index on (a)
> or (a, b):
>
>   SELECT DISTINCT foo FROM a
>
>   Index Only Scan for distinct values of (a) using foo_pkey on foo

Cool!

> In just the right circumstances that is vastly faster than scanning
> every tuple and uniquifying.  I was thinking that the next step in
> this experiment might be to introduce a kind of plan that can handle
> queries where the index's leading column is not in the WHERE clause,
> building on the above plan, and behaving conceptually a little bit
> like a Nested Loop, like so:
>
>   SELECT * FROM foo WHERE b = 42
>
>   Index Skip Scan
>     ->  Index Only Scan for distinct values of (a) using foo_pkey on foo
>     ->  Index Only Scan using foo_pkey on foo
>           Index Cond: ((a = outer.a) AND (b = 42))

Blech, that seems really ugly and probably inefficient.

> I suppose you could instead have a single node that knows how to
> perform the whole scan itself so that you don't finish up searching
> for each distinct value in both the inner and outer subplan, but it
> occurred to me that building the plan out of Lego bricks like this
> might generalise better to more complicated queries.  I suppose it
> might also parallelise nicely with a Parallel Index Only Scan as the
> outer plan.

I think putting all the smarts in a single node is likely to be
better, faster, and cleaner.

This patch has gotten favorable comments from several people, but
somebody really needs to REVIEW it.

Oh, well, since I'm here...

+    if (ScanDirectionIsForward(dir))
+    {
+        so->currPos.moreLeft = false;
+        so->currPos.moreRight = true;
+    }
+    else
+    {
+        so->currPos.moreLeft = true;
+        so->currPos.moreRight = false;
+    }

The lack of comments makes it hard for me to understand what the
motivation for this is, but I bet it's wrong.  Suppose there's a
cursor involved here and the user tries to back up. Instead of having
a separate amskip operation, maybe there should be a flag attached to
a scan indicating whether it should return only distinct results.
Otherwise, you're allowing for the possibility that the same scan
might sometimes skip and other times not skip, but then it seems hard
for the scan to survive cursor operations.  Anyway, why would that be
useful?

+    if (ScanDirectionIsForward(dir))
+        offnum = OffsetNumberPrev(offnum);
+    if (!_bt_readpage(scan, dir, offnum))
+    {
+        if (!_bt_steppage(scan, dir))
+        {
+            return false;
+        }
+    }

As the TODO:TM comment at the top of the function sort of implies
without directly coming out and saying so, this is ugly and suggests
that you've picked the wrong API.  If the idea I proposed above isn't
appealing, I still think we need to get rid of this by some other
means.

+    /*
+     * TODO:TM For now, we use _bt_search to search from the root; in theory
+     * we should be able to do a local traversal ie from the current page, but
+     * I don't know if it would actually be better in general.
+     */

I think that this depends a lot on the number of duplicates.  If we
end up merely fast-forwarding within the same page to find the next
unique value, re-traversing from the root sucks.  However, if the next
distinct key *isn't* on the same page, then traversing from the root
is a good bet.  A typical btree is only a few levels deep (like 3) so
once you don't find what you're looking for on the same page it's
probably fairly sound to re-traverse from the root.  Of course you
lose at least a little bit in the case where you would have found the
next distinct key within a page or two but in other cases you win big.
I wonder if that would be a suitable heuristic: first check the
current page, if all keys there are equal then retraverse from the
root.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company