Thread: BUG #3979: SELECT DISTINCT slow even on indexed column
The following bug has been logged online: Bug reference: 3979 Logged by: David Lee Email address: david_lee@bigfix.com PostgreSQL version: 8.2.6 Operating system: Ubuntu Feisty Server Description: SELECT DISTINCT slow even on indexed column Details: \d x: Column | Type | Modifiers --------------+-----------------------------+----------- a | integer | not null b | integer | not null time | timestamp without time zone | not null remote_time | timestamp without time zone | not null ip | inet | not null The table has 20 million rows. The table "x" has an index on ("a", "b"). I first tried: SELECT DISTINCT a, b FROM x but it was so slow. I ran EXPLAIN and it showed that the path did not use the index, so I ran: SET enable_seqscan = off; and ran the query again. Although it used the index, the query was still very slow. Finally, I ran: SELECT a, b FROM x GROUP BY a, b; But it was still the same. Next I created an index on ("a") and ran the query: SELECT DISTINCT a FROM x but the same thing happened (first didn't use the index; after turning seq-scan off, was still slow; tried using GROUP BY, still slow). The columns "a" and "b" are NOT NULL and has 100 distinct values each. The indexes are all btree indexes.
On Thu, 2008-02-21 at 23:34 +0000, David Lee wrote: > Finally, I ran: > SELECT a, b FROM x GROUP BY a, b; > > But it was still the same. > > Next I created an index on ("a") and ran the query: > SELECT DISTINCT a FROM x > > but the same thing happened (first didn't use the index; after turning > seq-scan off, was still slow; tried using GROUP BY, still slow). > > The columns "a" and "b" are NOT NULL and has 100 distinct values each. The > indexes are all btree indexes. If there are only 100 distinct values each, then that's only (at most) 10k distinct (a,b) pairs. To me it sounds like it would be most efficient to use a HashAggregate, which can only be used by the "GROUP BY" variant of the query you ran (DISTINCT can't use that plan). First, try to force a HashAggregate and see what the results are. If that is faster, the planner is not choosing the right plan. Try ANALYZE to update the statistics, and if that doesn't work, post EXPLAIN results. Also, this post is somewhat off-topic for -bugs, try posting to -general or -performance with this type of question. Regards, Jeff Davis
On Thu, 2008-02-21 at 23:34 +0000, David Lee wrote: > I ran EXPLAIN and it showed that the path did not use the index, so I ran: Your expectation that this would use an index is unfortunately not correct. We need to check visibility on the table rows to do the query. We choose to do this by doing a sort and then a unique operation. That's the most efficient plan when there are potentially many unique values. In the case you mention it would be faster for us to skip through the index retrieving at most one row from each value in the index. We don't currently do that, but we could. However that plan would be restricted only to queries of the form SELECT DISTINCT column-list-of-index FROM table; so its probably not going to be optimised anytime soon. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com