Re: DISTINCT with btree skip scan - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: DISTINCT with btree skip scan
Date
Msg-id CAEepm=2r5zqNfNvJ6fRoPX9K9nqtPqNUTKG_OucNeBMnaaB9gQ@mail.gmail.com
Whole thread Raw
In response to Re: DISTINCT with btree skip scan  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: DISTINCT with btree skip scan
Re: DISTINCT with btree skip scan
List pgsql-hackers
On Wed, Oct 12, 2016 at 4:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here it is, now split into two parts: one patch
> to add an amskip operation, and another to consider using it to
> implement DISTINCT when it was already has an index only scan path on
> an index that supports skipping.

Those patches add an amskip operation and then teach the planner and
executor how to do this, given a table foo (a, b) with an index on (a)
or (a, b):
 SELECT DISTINCT foo FROM a
 Index Only Scan for distinct values of (a) using foo_pkey on foo

In just the right circumstances that is vastly faster than scanning
every tuple and uniquifying.  I was thinking that the next step in
this experiment might be to introduce a kind of plan that can handle
queries where the index's leading column is not in the WHERE clause,
building on the above plan, and behaving conceptually a little bit
like a Nested Loop, like so:
 SELECT * FROM foo WHERE b = 42
 Index Skip Scan   ->  Index Only Scan for distinct values of (a) using foo_pkey on foo   ->  Index Only Scan using
foo_pkeyon foo         Index Cond: ((a = outer.a) AND (b = 42))
 

I suppose you could instead have a single node that knows how to
perform the whole scan itself so that you don't finish up searching
for each distinct value in both the inner and outer subplan, but it
occurred to me that building the plan out of Lego bricks like this
might generalise better to more complicated queries.  I suppose it
might also parallelise nicely with a Parallel Index Only Scan as the
outer plan.

Maybe something similar is possible for handling "GROUP BY a".

Worth pursuing?  Does amskip suck?  Does anyone have better ideas,
either for how to do the low level skip or the higher level Index Skip
Scan, or perhaps a completely different way of looking at this?

-- 
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "Maeldron T."
Date:
Subject: Re: Google Cloud Compute + FreeBSD + PostgreSQL = timecounter issue
Next
From: Tom Lane
Date:
Subject: Re: Calculation of param_source_rels in add_paths_to_joinrel