Thread: BUG #3979: SELECT DISTINCT slow even on indexed column

BUG #3979: SELECT DISTINCT slow even on indexed column

From
"David Lee"
Date:
The following bug has been logged online:

Bug reference:      3979
Logged by:          David Lee
Email address:      david_lee@bigfix.com
PostgreSQL version: 8.2.6
Operating system:   Ubuntu Feisty Server
Description:        SELECT DISTINCT slow even on indexed column
Details:

\d x:

   Column     |            Type             | Modifiers
--------------+-----------------------------+-----------
 a            | integer                     | not null
 b            | integer                     | not null
 time         | timestamp without time zone | not null
 remote_time  | timestamp without time zone | not null
 ip           | inet                        | not null

The table has 20 million rows.

The table "x" has an index on ("a", "b").

I first tried:
 SELECT DISTINCT a, b FROM x

but it was so slow.

I ran EXPLAIN and it showed that the path did not use the index, so I ran:

 SET enable_seqscan = off;

and ran the query again.

Although it used the index, the query was still very slow.

Finally, I ran:
 SELECT a, b FROM x GROUP BY a, b;

But it was still the same.

Next I created an index on ("a") and ran the query:
 SELECT DISTINCT a FROM x

but the same thing happened (first didn't use the index; after turning
seq-scan off, was still slow; tried using GROUP BY, still slow).

The columns "a" and "b" are NOT NULL and has 100 distinct values each. The
indexes are all btree indexes.

Re: BUG #3979: SELECT DISTINCT slow even on indexed column

From
Jeff Davis
Date:
On Thu, 2008-02-21 at 23:34 +0000, David Lee wrote:
> Finally, I ran:
>  SELECT a, b FROM x GROUP BY a, b;
>
> But it was still the same.
>
> Next I created an index on ("a") and ran the query:
>  SELECT DISTINCT a FROM x
>
> but the same thing happened (first didn't use the index; after turning
> seq-scan off, was still slow; tried using GROUP BY, still slow).
>
> The columns "a" and "b" are NOT NULL and has 100 distinct values each. The
> indexes are all btree indexes.

If there are only 100 distinct values each, then that's only (at most)
10k distinct (a,b) pairs.

To me it sounds like it would be most efficient to use a HashAggregate,
which can only be used by the "GROUP BY" variant of the query you ran
(DISTINCT can't use that plan).

First, try to force a HashAggregate and see what the results are. If
that is faster, the planner is not choosing the right plan. Try ANALYZE
to update the statistics, and if that doesn't work, post EXPLAIN
results.

Also, this post is somewhat off-topic for -bugs, try posting to -general
or -performance with this type of question.

Regards,
    Jeff Davis

Re: BUG #3979: SELECT DISTINCT slow even on indexed column

From
Simon Riggs
Date:
On Thu, 2008-02-21 at 23:34 +0000, David Lee wrote:

> I ran EXPLAIN and it showed that the path did not use the index, so I ran:

Your expectation that this would use an index is unfortunately not
correct.

We need to check visibility on the table rows to do the query. We choose
to do this by doing a sort and then a unique operation. That's the most
efficient plan when there are potentially many unique values.

In the case you mention it would be faster for us to skip through the
index retrieving at most one row from each value in the index. We don't
currently do that, but we could. However that plan would be restricted
only to queries of the form
  SELECT DISTINCT column-list-of-index FROM table;

so its probably not going to be optimised anytime soon.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com