Re: BUG #4462: Adding COUNT to query causes massive slowdown - Mailing list pgsql-bugs

From Jussi Pakkanen
Subject Re: BUG #4462: Adding COUNT to query causes massive slowdown
Date
Msg-id 42d23b2e0810101255p59d5d1eyce79a0c917bd72ed@mail.gmail.com
Whole thread Raw
In response to Re: BUG #4462: Adding COUNT to query causes massive slowdown  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #4462: Adding COUNT to query causes massive slowdown  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Thu, Oct 9, 2008 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> Given that PostgreSQL does the scan even with the huge seqscan
>> penalty, I can think of only two different causes:
>> 1) some sort of a bug in the query analyzer
>> 2) SELECT COUNT(DISTINCT x) for some reason requires information that
>> is not available in the index.
>
> Try (3) COUNT(DISTINCT x) ... or any DISTINCT aggregate for that matter
> ... is implemented by a sort-and-uniq step inside the aggregate function
> itself.  You can't see it in the plan.

Does this mean that the sort-and-uniq will always lead to a full table
scan? It would seem so, because I could not force PostgreSQL to use
the index even with enable_seqscan set to off. I understand that in
some cases the table scan is faster, but it is very strange if the
query optimizer refuses to use the index no matter what.

A quick calculation says that the table scan needs to access 32
million elements (and sort them, and uniq them). An index scan needs
only 2 million (or 4 million I suppose, if you account for the higher
levels in the B-tree). That is an order of magnitude difference in
disk reads alone.

> I wouldn't actually think that this approach would be slower than an
> indexscan, btw, unless maybe the index were very nearly correlated with
> physical order --- but that would make the sort more efficient, too.

I have ordered the data table according to the "code" column by first
importing it to a temp table and then building the actual "log" table
with CREATE TABLE log AS SELECT ... ORDER BY code;.

In this case the index is faster. A lot faster. SELECT DISTINCT using
the index and counting the rows takes 4 minutes. Building a view with
distinct "code"s and counting that also takes 4 minutes. But the
aggregate scan takes 11 minutes. Worst of all, COUNT(DISTINCT ...) is
the way all SQL books tell you to do these kinds of queries.

> Perhaps you need to raise work_mem enough to allow the sort to take
> place without spilling to disk?  (Turning on trace_sort should let you
> see what's happening there.)

But isn't this just hiding the root cause? Having working sets larger
than available memory is not that uncommon.

pgsql-bugs by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: BUG #4465: GROUP BY is not to SQL standard
Next
From: Tom Lane
Date:
Subject: Re: BUG #4462: Adding COUNT to query causes massive slowdown