Thread: select distinct uses index scan vs full table scan

select distinct uses index scan vs full table scan

From
Jon Nelson
Date:
I've got a 5GB table with about 12 million rows.
Recently, I had to select the distinct values from just one column.
The planner chose an index scan. The query took almost an hour.
When I forced index scan off, the query took 90 seconds (full table scan).

The planner estimated 70,000 unique values when, in fact, there are 12
million (the value for this row is *almost* but not quite unique).
What's more, despite bumping the statistics on that column up to 1000
and re-analyzing, the planner now thinks that there are 300,000 unique
values.
How can I tell the planner that a given column is much more unique
than, apparently, it thinks it is?
The column type is INET.
This is on PG 8.4.10 on Linux x86_64, with
81f4e6cd27d538bc27e9714a9173e4df353a02e5 applied.

--
Jon

Re: select distinct uses index scan vs full table scan

From
Tom Lane
Date:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> I've got a 5GB table with about 12 million rows.
> Recently, I had to select the distinct values from just one column.
> The planner chose an index scan. The query took almost an hour.
> When I forced index scan off, the query took 90 seconds (full table scan).

Usually, we hear complaints about the opposite.  Are you using
nondefault cost settings?

> The planner estimated 70,000 unique values when, in fact, there are 12
> million (the value for this row is *almost* but not quite unique).
> What's more, despite bumping the statistics on that column up to 1000
> and re-analyzing, the planner now thinks that there are 300,000 unique
> values.

Accurate ndistinct estimates are hard, but that wouldn't have much of
anything to do with this particular choice, AFAICS.

> How can I tell the planner that a given column is much more unique
> than, apparently, it thinks it is?

9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct.

            regards, tom lane

Re: select distinct uses index scan vs full table scan

From
Jon Nelson
Date:
On Tue, Dec 13, 2011 at 1:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> I've got a 5GB table with about 12 million rows.
>> Recently, I had to select the distinct values from just one column.
>> The planner chose an index scan. The query took almost an hour.
>> When I forced index scan off, the query took 90 seconds (full table scan).
>
> Usually, we hear complaints about the opposite.  Are you using
> nondefault cost settings?

Cost settings had not been changed until a few minutes ago when your
response prompted me to try a few things.

I ended up changing the random_page_cost to 16.0 (from 4.0), partly
because the H/W raid I'm using is awful bad at random I/O. I'll
experiment and keep tabs on performance to see if this has a negative
effect on other aspects.

>> The planner estimated 70,000 unique values when, in fact, there are 12
>> million (the value for this row is *almost* but not quite unique).
>> What's more, despite bumping the statistics on that column up to 1000
>> and re-analyzing, the planner now thinks that there are 300,000 unique
>> values.
>
> Accurate ndistinct estimates are hard, but that wouldn't have much of
> anything to do with this particular choice, AFAICS.
>
>> How can I tell the planner that a given column is much more unique
>> than, apparently, it thinks it is?
>
> 9.0 and up have ALTER TABLE ... ALTER COLUMN ... SET n_distinct.

D'oh!  I'm on 8.4.10+patches.
This may provide the necessary push.

--
Jon