Thread: DISTINCT with btree skip scan
Hello
As an exercise I hacked up the simplest code I could think of that would demonstrate a faster DISTINCT based on skipping ahead to the next distinct value in an index-only scan. Please see the attached (extremely buggy) patch, and the example session below. (It's against my natural instinct to send such half-baked early hacking phase code to the list, but thought it would make sense to demo the concept and then seek advice, warnings, cease and desist notices etc before pressing on down that route!) I would be most grateful for any feedback you might have.
Best regards,
Thomas Munro
postgres=# create table foo (a int, b int, primary key (a, b));
CREATE TABLE
Time: 9.002 ms
postgres=# insert into foo select generate_series(1, 5000000) % 10, generate_series(1, 5000000);
INSERT 0 5000000
Time: 35183.444 ms
postgres=# analyze foo;
ANALYZE
Time: 493.168 ms
postgres=# set enable_hashagg = true;
SET
Time: 0.274 ms
postgres=# explain select distinct a from foo;
┌───────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────┤
│ HashAggregate (cost=84624.00..84624.10 rows=10 width=4) │
│ Group Key: a │
│ -> Seq Scan on foo (cost=0.00..72124.00 rows=5000000 width=4) │
│ Planning time: 0.065 ms │
└───────────────────────────────────────────────────────────────────┘
(4 rows)
Time: 0.500 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 6 │
│ 8 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 9 │
│ 7 │
│ 0 │
└───┘
(10 rows)
Time: 2017.019 ms
postgres=# set enable_hashagg = false;
SET
Time: 0.302 ms
postgres=# explain select distinct a from foo;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan for distinct prefix 1 using foo_pkey on foo (cost=0.43..263354.20 rows=10 width=4) │
│ Planning time: 0.063 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
(2 rows)
Time: 0.443 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)
Time: 0.565 ms
As an exercise I hacked up the simplest code I could think of that would demonstrate a faster DISTINCT based on skipping ahead to the next distinct value in an index-only scan. Please see the attached (extremely buggy) patch, and the example session below. (It's against my natural instinct to send such half-baked early hacking phase code to the list, but thought it would make sense to demo the concept and then seek advice, warnings, cease and desist notices etc before pressing on down that route!) I would be most grateful for any feedback you might have.
Best regards,
Thomas Munro
postgres=# create table foo (a int, b int, primary key (a, b));
CREATE TABLE
Time: 9.002 ms
postgres=# insert into foo select generate_series(1, 5000000) % 10, generate_series(1, 5000000);
INSERT 0 5000000
Time: 35183.444 ms
postgres=# analyze foo;
ANALYZE
Time: 493.168 ms
postgres=# set enable_hashagg = true;
SET
Time: 0.274 ms
postgres=# explain select distinct a from foo;
┌───────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────┤
│ HashAggregate (cost=84624.00..84624.10 rows=10 width=4) │
│ Group Key: a │
│ -> Seq Scan on foo (cost=0.00..72124.00 rows=5000000 width=4) │
│ Planning time: 0.065 ms │
└───────────────────────────────────────────────────────────────────┘
(4 rows)
Time: 0.500 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 6 │
│ 8 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 9 │
│ 7 │
│ 0 │
└───┘
(10 rows)
Time: 2017.019 ms
postgres=# set enable_hashagg = false;
SET
Time: 0.302 ms
postgres=# explain select distinct a from foo;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan for distinct prefix 1 using foo_pkey on foo (cost=0.43..263354.20 rows=10 width=4) │
│ Planning time: 0.063 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
(2 rows)
Time: 0.443 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)
Time: 0.565 ms
Attachment
On 07/05/2014 02:17 AM, Thomas Munro wrote: > As an exercise I hacked up the simplest code I could think of that would > demonstrate a faster DISTINCT based on skipping ahead to the next > distinct value in an index-only scan. Please see the attached (extremely > buggy) patch, and the example session below. (It's against my natural > instinct to send such half-baked early hacking phase code to the list, > but thought it would make sense to demo the concept and then seek > advice, warnings, cease and desist notices etc before pressing on down > that route!) I would be most grateful for any feedback you might have. This is called a Loose Index Scan in our wiki[1] which I believe is based on the terminology used for the same feature in MySQL although I haven't and shan't check. This is a feature I would really like to have. Your benchmarks look promising but unfortunately it blows up for me so I can't really test it. I have not yet read the patch nor debugged to see why, but please do keep up work on this. People more expert than I can tell you whether you're implementing it the right way. [1] http://wiki.postgresql.org/wiki/Loose_indexscan -- Vik
On 5 July 2014 02:03, Vik Fearing <vik.fearing@dalibo.com> wrote: > [1] http://wiki.postgresql.org/wiki/Loose_indexscan Thanks. It talks about DISTINCT, and also about using index when you don't have the leading column in your WHERE clause (as well as MySQL ("loose"), at least Oracle ("skip"), SQLite ("skip"), DB2 ("jump") can do this). It looks like at least MySQL can also use loose index scans to implement GROUP BY in certain cases involving MIN or MAX aggregate functions (things like SELECT a, MIN(b) FROM foo GROUP BY a, given an index on (a, b)). But I'm only trying to implement the lowest hanging index skipping plan: plain old DISTINCT. I think I see roughly how to cost, plan and execute it... now to learn a lot more about PG internals...
On Sat, Jul 5, 2014 at 12:17 PM, Thomas Munro <munro@ip9.org> wrote:
postgres=# set enable_hashagg = false;
SET
Time: 0.302 ms
postgres=# explain select distinct a from foo;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Only Scan for distinct prefix 1 using foo_pkey on foo (cost=0.43..263354.20 rows=10 width=4) │
│ Planning time: 0.063 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
(2 rows)
Time: 0.443 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)
Time: 0.565 ms
Hi Thomas,
I've had a quick look at this and it seems like a great win! I'm quite surprised that we've not got this already. I think this technology could also really help performance of queries such as SELECT * from bigtable bt WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol = obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve that first off, but it could be done later once the benefits of this are more commonly realised.
I think our shortfalls in this area have not gone unnoticed. I was reading this post https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html about comparing performance of COUNT(DISTINCT col). I think this would give a big win for query 3 in that post. I'm trying to find some time make some changes to transform queries to allow the group by to happen before the joins when possible, so between that and your patch we'd be 2 steps closer to making query 1 in the link above perform a little better on PostgreSQL.
Do you think you'll manage to get time to look at this a bit more? I'd be keen to look into the costing side of it if you think that'll help you any?
Regards
David Rowley
On 27 October 2014 20:24, David Rowley <dgrowleyml@gmail.com> wrote: > I've had a quick look at this and it seems like a great win! I'm quite > surprised that we've not got this already. I think this technology could > also really help performance of queries such as SELECT * from bigtable bt > WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol = > obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve > that first off, but it could be done later once the benefits of this are > more commonly realised. > > I think our shortfalls in this area have not gone unnoticed. I was reading > this post > https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html > about comparing performance of COUNT(DISTINCT col). I think this would give > a big win for query 3 in that post. I'm trying to find some time make some > changes to transform queries to allow the group by to happen before the > joins when possible, so between that and your patch we'd be 2 steps closer > to making query 1 in the link above perform a little better on PostgreSQL. > > Do you think you'll manage to get time to look at this a bit more? I'd be > keen to look into the costing side of it if you think that'll help you any? Hi David, Sorry for the silence, I have been busy moving countries. I am definitely interested in collaborating on a series of patches to implement various kinds of skip-based plans as seen in other RDBMSs if others think it could be useful. I see skip-based DISTINCT as a good place to start. (I suspect the semi-related skip scan plan (for the case when your WHERE clause doesn't have a qual for the leading column(s)) is much harder to plan and execute and I also suspect it's less generally useful). Here is a rebased version of that patch which fixes a crashing off-by-one error, and handles backward scans properly, I think. This is still just a quick hack to play with the general idea at this stage. It works by adding a new index operation 'skip' which the executor code can use during a scan to advance to the next value (for some prefix of the index's columns). That may be a terrible idea and totally unnecessary... but let me explain my reasoning: 1. Perhaps some index implementations can do something better than a search for the next key value from the root. Is it possible or desirable to use the current position as a starting point for a btree traversal? I don't know. 2. It seemed that I'd need to create a new search ScanKey to use the 'rescan' interface for skipping to the next value, but I already had an insertion ScanKey so I wanted a way to just reuse that. But maybe there is some other way to reuse existing index interfaces, or maybe there is an easy way to make a new search ScanKey from the existing insertion ScanKey? Currently it uses the existing IndexOnlyScan plan node, adding an extra member. I briefly tried making a different plan called DistinctIndexScan but it seemed I was going to need to duplicate or refactor/share a lot of IndexOnlyScan code (this might be the right thing to do but at this stage I'm just trying to demo the general idea with minimal changes). As for costing, I haven't looked at that part of PG at all yet. If you *can* do a distinct index scan, would you *always* want to? How well could we estimate the cost using existing statistics? I suppose the expected number of page fetches is proportional to the expected number of distinct values (ie skip operations) times btree depth, and the cost may be adjusted in some way for cache effects (?), but how well can we estimate the number of distinct values (a), (a, b) or (a, b, c) in some range for an index on (a, b, c, d)? As before, you still need the kludge of disabling hashagg to reach the new plan: postgres=# set enable_hashagg = true; SET Time: 0.213 ms postgres=# select distinct a from foo; ┌───┐ │ a │ ├───┤ │ 6 │ │ 8 │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 9 │ │ 7 │ │ 0 │ └───┘ (10 rows) Time: 890.166 ms postgres=# set enable_hashagg = false; SET Time: 0.271 ms postgres=# select distinct a from foo; ┌───┐ │ a │ ├───┤ │ 0 │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ 8 │ │ 9 │ └───┘ (10 rows) Time: 0.573 ms Best regards, Thomas Munro
Attachment
On Sat, Nov 1, 2014 at 10:35 AM, Thomas Munro <munro@ip9.org> wrote: > I am definitely interested in collaborating on a series of patches to > implement various kinds of skip-based plans as seen in other RDBMSs if > others think it could be useful. I see skip-based DISTINCT as a good > place to start. (I suspect the semi-related skip scan plan (for the > case when your WHERE clause doesn't have a qual for the leading > column(s)) is much harder to plan and execute and I also suspect it's > less generally useful). There's an interesting thread about merge joins with a skip optimisation[1] ("leapfrog trie-join", though I haven't read the paper yet). That reminded me to go and rebase this experimental patch which is somewhat related. Here it is, now split into two parts: one patch to add an amskip operation, and another to consider using it to implement DISTINCT when it was already has an index only scan path on an index that supports skipping. As you can see it's still sketchy experiment-grade code, but it now has a first attempt at costing. postgres=# select count(a) from foo; ┌─────────┐ │ count │ ├─────────┤ │ 5000000 │ └─────────┘ (1 row) Time: 493.501 ms postgres=# select distinct a from foo; ┌───┐ │ a │ ├───┤ │ 0 │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ 8 │ │ 9 │ └───┘ (10 rows) Time: 0.516 ms postgres=# explain select distinct a from foo; ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Index Only Scan for distinct values of leading 1 column(s) using foo_pkey on foo (cost=0.43..4.33 rows=10 width=4) │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (1 row) Time: 0.378 ms [1] https://www.postgresql.org/message-id/flat/CAFQtOwp4e1hU6gKi1R_nHgOLTzUJko8EcaC1FdAV0rQFo1ARzA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Oct 12, 2016 at 4:19 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here it is, now split into two parts: one patch > to add an amskip operation, and another to consider using it to > implement DISTINCT when it was already has an index only scan path on > an index that supports skipping. Those patches add an amskip operation and then teach the planner and executor how to do this, given a table foo (a, b) with an index on (a) or (a, b): SELECT DISTINCT foo FROM a Index Only Scan for distinct values of (a) using foo_pkey on foo In just the right circumstances that is vastly faster than scanning every tuple and uniquifying. I was thinking that the next step in this experiment might be to introduce a kind of plan that can handle queries where the index's leading column is not in the WHERE clause, building on the above plan, and behaving conceptually a little bit like a Nested Loop, like so: SELECT * FROM foo WHERE b = 42 Index Skip Scan -> Index Only Scan for distinct values of (a) using foo_pkey on foo -> Index Only Scan using foo_pkeyon foo Index Cond: ((a = outer.a) AND (b = 42)) I suppose you could instead have a single node that knows how to perform the whole scan itself so that you don't finish up searching for each distinct value in both the inner and outer subplan, but it occurred to me that building the plan out of Lego bricks like this might generalise better to more complicated queries. I suppose it might also parallelise nicely with a Parallel Index Only Scan as the outer plan. Maybe something similar is possible for handling "GROUP BY a". Worth pursuing? Does amskip suck? Does anyone have better ideas, either for how to do the low level skip or the higher level Index Skip Scan, or perhaps a completely different way of looking at this? -- Thomas Munro http://www.enterprisedb.com
On 23 November 2016 at 21:19, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Worth pursuing? Does amskip suck? Does anyone have better ideas, > either for how to do the low level skip or the higher level Index Skip > Scan, or perhaps a completely different way of looking at this? I have no helpful suggestions with how to achieve it, but can I add a voice of encouragement: there have been a good few occasions in the past year (we moved from another db to PG) where the lack of skip scans has bitten us; in that case it was using MIN() and GROUP BY, where the grouping column was the first element in the compound index and the aggregate column was the second: in the Other DB that sort of query was extremely quick, not so much here. I was also idly pondering (in one of those moments of conceited self-delusion where I thought I might actually have enough spare time to try to work on it myself) whether it would be possible to implement that sort of skip with two indexes (so MIN(a) GROUP BY b with separate indexes on (a) and (b) rather than a single index (a,b)); I never got much further than idle musings though. Geoff
On Wed, Nov 23, 2016 at 4:19 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Oct 12, 2016 at 4:19 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Here it is, now split into two parts: one patch >> to add an amskip operation, and another to consider using it to >> implement DISTINCT when it was already has an index only scan path on >> an index that supports skipping. > > Those patches add an amskip operation and then teach the planner and > executor how to do this, given a table foo (a, b) with an index on (a) > or (a, b): > > SELECT DISTINCT foo FROM a > > Index Only Scan for distinct values of (a) using foo_pkey on foo Cool! > In just the right circumstances that is vastly faster than scanning > every tuple and uniquifying. I was thinking that the next step in > this experiment might be to introduce a kind of plan that can handle > queries where the index's leading column is not in the WHERE clause, > building on the above plan, and behaving conceptually a little bit > like a Nested Loop, like so: > > SELECT * FROM foo WHERE b = 42 > > Index Skip Scan > -> Index Only Scan for distinct values of (a) using foo_pkey on foo > -> Index Only Scan using foo_pkey on foo > Index Cond: ((a = outer.a) AND (b = 42)) Blech, that seems really ugly and probably inefficient. > I suppose you could instead have a single node that knows how to > perform the whole scan itself so that you don't finish up searching > for each distinct value in both the inner and outer subplan, but it > occurred to me that building the plan out of Lego bricks like this > might generalise better to more complicated queries. I suppose it > might also parallelise nicely with a Parallel Index Only Scan as the > outer plan. I think putting all the smarts in a single node is likely to be better, faster, and cleaner. This patch has gotten favorable comments from several people, but somebody really needs to REVIEW it. Oh, well, since I'm here... + if (ScanDirectionIsForward(dir)) + { + so->currPos.moreLeft = false; + so->currPos.moreRight = true; + } + else + { + so->currPos.moreLeft = true; + so->currPos.moreRight = false; + } The lack of comments makes it hard for me to understand what the motivation for this is, but I bet it's wrong. Suppose there's a cursor involved here and the user tries to back up. Instead of having a separate amskip operation, maybe there should be a flag attached to a scan indicating whether it should return only distinct results. Otherwise, you're allowing for the possibility that the same scan might sometimes skip and other times not skip, but then it seems hard for the scan to survive cursor operations. Anyway, why would that be useful? + if (ScanDirectionIsForward(dir)) + offnum = OffsetNumberPrev(offnum); + if (!_bt_readpage(scan, dir, offnum)) + { + if (!_bt_steppage(scan, dir)) + { + return false; + } + } As the TODO:TM comment at the top of the function sort of implies without directly coming out and saying so, this is ugly and suggests that you've picked the wrong API. If the idea I proposed above isn't appealing, I still think we need to get rid of this by some other means. + /* + * TODO:TM For now, we use _bt_search to search from the root; in theory + * we should be able to do a local traversal ie from the current page, but + * I don't know if it would actually be better in general. + */ I think that this depends a lot on the number of duplicates. If we end up merely fast-forwarding within the same page to find the next unique value, re-traversing from the root sucks. However, if the next distinct key *isn't* on the same page, then traversing from the root is a good bet. A typical btree is only a few levels deep (like 3) so once you don't find what you're looking for on the same page it's probably fairly sound to re-traverse from the root. Of course you lose at least a little bit in the case where you would have found the next distinct key within a page or two but in other cases you win big. I wonder if that would be a suitable heuristic: first check the current page, if all keys there are equal then retraverse from the root. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company