Re: Index Skip Scan - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Index Skip Scan |
Date | |
Msg-id | CAKJS1f86FgODuUnHiQ25RKeuES4qTqeNxm1QbqJWrBoZxVGLiQ@mail.gmail.com Whole thread Raw |
In response to | Re: Index Skip Scan (Jesper Pedersen <jesper.pedersen@redhat.com>) |
Responses |
Re: Index Skip Scan
Re: Index Skip Scan |
List | pgsql-hackers |
On Fri, 14 Jun 2019 at 04:32, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Here is a rebased version. Hi Jesper, I read over this thread a few weeks ago while travelling back from PGCon. (I wish I'd read it on the outward trip instead since it would have been good to talk about it in person.) First off. I think this is a pretty great feature. It certainly seems worthwhile working on it. I've looked over the patch just to get a feel for how the planner part works and I have a few ideas to share. The code in create_distinct_paths() I think should work a different way. I think it would be much better to add a new field to Path and allow a path to know what keys it is distinct for. This sort of goes back to an idea I thought about when developing unique joins (9c7f5229ad) about an easier way to detect fields that a relation is unique for. I've been calling these "UniqueKeys" in a few emails [1]. The idea was to tag these onto RelOptInfo to mention which columns or exprs a relation is unique by so that we didn't continuously need to look at unique indexes in all the places that call relation_has_unique_index_for(). The idea there was that unique joins would know when a join was unable to duplicate rows. If the outer side of a join didn't duplicate the inner side, then the join RelOptInfo could keep the UniqueKeys from the inner rel, and vice-versa. If both didn't duplicate then the join rel would obtain the UniqueKeys from both sides of the join. The idea here is that this would be a better way to detect unique joins, and also when it came to the grouping planner we'd also know if the distinct or group by should be a no-op. DISTINCT could be skipped, and GROUP BY could do a group aggregate without any sort. I think these UniqueKeys ties into this work, perhaps not adding UniqueKeys to RelOptInfo, but just to Path so that we create paths that have UniqueKeys during create_index_paths() based on some uniquekeys that are stored in PlannerInfo, similar to how we create index paths in build_index_paths() by checking if the index has_useful_pathkeys(). Doing it this way would open up more opportunities to use skip scans. For example, semi-joins and anti-joins could make use of them if the uniquekeys covered the entire join condition. With this idea, the code you've added in create_distinct_paths() can just search for the cheapest path that has the correct uniquekeys, or if none exist then just do the normal sort->unique or hash agg. I'm not entirely certain how we'd instruct a semi/anti joined relation to build such paths, but that seems like a problem that could be dealt with when someone does the work to allow skip scans to be used for those. Also, I'm not entirely sure that these UniqueKeys should make use of PathKey since there's no real need to know about pk_opfamily, pk_strategy, pk_nulls_first as those all just describe how the keys are ordered. We just need to know if they're distinct or not. All that's left after removing those fields is pk_eclass, so could UniqueKeys just be a list of EquivalenceClass? or perhaps even a Bitmapset with indexes into PlannerInfo->ec_classes (that might be premature for not since we've not yet got https://commitfest.postgresql.org/23/1984/ or https://commitfest.postgresql.org/23/2019/ ) However, if we did use PathKey, that does allow us to quickly check if the UniqueKeys are contained within the PathKeys, since pathkeys are canonical which allows us just to compare their memory address to know if two are equal, however, if you're storing eclasses we could probably get the same just by comparing the address of the eclass to the pathkey's pk_eclass. Otherwise, I think how you're making use of paths in create_distinct_paths() and create_skipscan_unique_path() kind of contradicts how they're meant to be used. I also agree with James that this should not be limited to Index Only Scans. From testing the patch, the following seems pretty strange to me: # create table abc (a int, b int, c int); CREATE TABLE # insert into abc select a,b,1 from generate_Series(1,1000) a, generate_Series(1,1000) b; INSERT 0 1000000 # create index on abc(a,b); CREATE INDEX # explain analyze select distinct on (a) a,b from abc order by a,b; -- this is fast. QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200 width=8) (actual time=0.260..20.518 rows=1000 loops=1) Scan mode: Skip scan Heap Fetches: 1000 Planning Time: 5.616 ms Execution Time: 21.791 ms (5 rows) # explain analyze select distinct on (a) a,b,c from abc order by a,b; -- Add one more column and it's slow. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Unique (cost=0.42..50104.43 rows=200 width=12) (actual time=1.201..555.280 rows=1000 loops=1) -> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43 rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000 loops=1) Planning Time: 0.102 ms Execution Time: 555.407 ms (4 rows) [1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: