Re: BUG #4462: Adding COUNT to query causes massive slowdown - Mailing list pgsql-bugs

From Jussi Pakkanen
Subject Re: BUG #4462: Adding COUNT to query causes massive slowdown
Date
Msg-id 42d23b2e0810150402r529fbafev70f5b13ea4e333c8@mail.gmail.com
Whole thread Raw
In response to Re: BUG #4462: Adding COUNT to query causes massive slowdown  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Sat, Oct 11, 2008 at 12:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The sort-and-uniq doesn't care where the data came from.  But if we have
> to feed it all rows of the table, as we do here, we're going to use a
> seqscan.  An indexscan can never beat a seqscan for retrieving the whole
> table.

Apologies if you get this more than once, but it seems that my first
mail got lost in the moderation queue somewhere.

Anyway, in this case we should not need to scan the entire table.
There are 32 million rows and 2 million unique 'code's. Thus we should
be able to just walk the index (for a list of potential unique
elements) and then check one of the roughly 15 rows pointed to by the
index leaf nodes to get the visibility info.

To prove that PostgreSQL does not need to do a full table scan and
that it is faster to use the index, I simplified the problem to the
following test case.

From an SQL point of view, the following two queries produce an
identical result:

SELECT COUNT(DISTINCT code) FROM log;
SELECT COUNT(*) FROM (SELECT DISTINCT code FROM log) AS distincttable;

(I don't know about ACID requirements all that much, but they should
be identical for both AFAICS.)

Therefore, an optimal query planner would reduce them to exact same
elemental operations. PostgreSQL does NOT do this. It does a full
table scan for the first one and an index scan for the second one. The
table scan is over two times slower.

If a full table scan is absolutely necessary, then the bug is that the
second query uses the index (and thus may give incorrect results).

If an index scan is sufficient, the bug is that the query planner does
not even consider the index for the first query, which leads to a big
performance penalty on an extremely common operation.

pgsql-bugs by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: BUG #4465: GROUP BY is not to SQL standard
Next
From: "Tony Marston"
Date:
Subject: Re: BUG #4465: GROUP BY is not to SQL standard