Thread: Loose Index Scans by Planner?
Maybe I should post this in Hackers instead, but I figured I'd start here to avoid cluttering up that list. So, we know we have a way of doing a loose index scan with CTEs: http://wiki.postgresql.org/wiki/Loose_indexscan But that got me wondering. The planner knows from pg_stats that col1 could have low cardinality. If that's the case, and a WHERE clause uses a two column index, and col2 is specified, why can't it walk each individual bucket in the two-column index, and use col2? So I forced such a beast with a CTE: WITH RECURSIVE t AS ( SELECT min(col1) AS col1 FROM tablename UNION ALL SELECT (SELECT min(col1) FROM tablename WHERE col1 > t.col1) FROM t WHERE t.col1 IS NOT NULL ) SELECT p.* FROM t JOIN tablename p USING (col1) where p.col2 = 12345 I ask, because while the long-term fix would be to re-order the index to (col2, col1), this seems like a situation the planner could easily detect and compensate for. In our particular example, execution time went from 160ms to 2ms with the CTE rewrite. This is a contrived example, but it seems like loose index scans would be useful in other ways. Heck, this: SELECT DISTINCT col1 FROM tablename; Has terrible performance because it always seems to revert to a sequence scan, but it's something people do *all the time*. I can't reasonably expect all of my devs to switch to that admittedly gross CTE to get a faster effect, so I'm just thinking out loud. Until PG puts in something to fix this, I plan on writing a stored procedure that writes a dynamic CTE and returns a corresponding result set. It's not ideal, but it would solve our particular itch. Really, this should be possible with any indexed column, so I might abstract it. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Shaun Thomas <sthomas@optionshouse.com> wrote: > So, we know we have a way of doing a loose index scan with CTEs: > > http://wiki.postgresql.org/wiki/Loose_indexscan I tried this on a table in production with 23 million rows for a column with 45 distinct values which is the high-order column of a four-column index. This ran in 445 ms first time and 2 ms on the second and subsequent tries. The equivalent SELECT DISTINCT ran in 30 seconds first time, and got down to 11.5 seconds after a few runs. So roughly two orders of magnitude faster with a cold cache and three orders of magnitude faster with a warm cache. That sure would be a nice optimization to have in the planner. > But that got me wondering. The planner knows from pg_stats that > col1 could have low cardinality. If that's the case, and a WHERE > clause uses a two column index, and col2 is specified, why can't > it walk each individual bucket in the two-column index, and use > col2? So I forced such a beast with a CTE: > > WITH RECURSIVE t AS ( > SELECT min(col1) AS col1 > FROM tablename > UNION ALL > SELECT (SELECT min(col1) > FROM tablename > WHERE col1 > t.col1) > FROM t > WHERE t.col1 IS NOT NULL > ) > SELECT p.* > FROM t > JOIN tablename p USING (col1) > where p.col2 = 12345 > > I ask, because while the long-term fix would be to re-order the > index to (col2, col1), this seems like a situation the planner > could easily detect and compensate for. In our particular example, > execution time went from 160ms to 2ms with the CTE rewrite. Well, that'd be the icing on the cake. I'd be overjoyed to get the cake. :-) -Kevin
On 08/24/2012 02:40 PM, Kevin Grittner wrote: > Well, that'd be the icing on the cake. I'd be overjoyed to get the > cake. :-) Yes indeed. The "cake" would fix the DISTINCT case, which I see way more often in the wild than my index column-skip. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Fri, Aug 24, 2012 at 9:20 AM, Shaun Thomas <sthomas@optionshouse.com> wrote: > Maybe I should post this in Hackers instead, but I figured I'd start here to > avoid cluttering up that list. > > So, we know we have a way of doing a loose index scan with CTEs: > > http://wiki.postgresql.org/wiki/Loose_indexscan > > But that got me wondering. The planner knows from pg_stats that col1 could > have low cardinality. If that's the case, and a WHERE clause uses a two > column index, and col2 is specified, why can't it walk each individual > bucket in the two-column index, and use col2? So I forced such a beast with > a CTE: > > WITH RECURSIVE t AS ( > SELECT min(col1) AS col1 > FROM tablename > UNION ALL > SELECT (SELECT min(col1) > FROM tablename > WHERE col1 > t.col1) > FROM t > WHERE t.col1 IS NOT NULL > ) > SELECT p.* > FROM t > JOIN tablename p USING (col1) > where p.col2 = 12345 That is awesome. I had never though of trying to do it that way. > I ask, because while the long-term fix would be to re-order the index to > (col2, col1), Not always. The case for having (col1,col2) might be very compelling. And having to maintain both orderings when just maintaining one would be "good enough" would kind of suck. Having the planner do the best it can given the index it has is a good thing. I would also note that having this feature (called "skip scan" in some other products) would mimic what happens when you need to do a query specifying col2 but not col1 on a table family which is list partitioned on col1. Getting some of the benefits of partitioning without having to actually do the partitioning would be a good thing. > this seems like a situation the planner could easily detect > and compensate for. Yes, it is just a Small Matter Of Programming :) And one I've wanted for a while. If only someone else would offer to do it for me.... Cheers, Jeff
> Not always. The case for having (col1,col2) might be very > compelling. True. But in our case, the table has like 8M rows, so and col1 is some kind of job identifier, so it's evenly distributed.Col2 on the other hand is a customer id, so it has much higher cardinality. Previous DBA missed it during creation,and it was never loud enough in the logs for me to notice it until recently. Looks like I need to do a column-orderingaudit. :) > If only someone else would offer to do it for me.... Don't look at me. My C is rustier than a 50-year-old bike chain. :) ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email