Re: Slow count(*) again... - Mailing list pgsql-performance

From Craig Ringer
Subject Re: Slow count(*) again...
Date
Msg-id 4CB1638F.2080405@postnewspapers.com.au
Whole thread Raw
In response to Re: Slow count(*) again...  (Neil Whelchel <neil.whelchel@gmail.com>)
Responses Re: Slow count(*) again...  (Neil Whelchel <neil.whelchel@gmail.com>)
List pgsql-performance
On 10/10/2010 11:02 AM, Neil Whelchel wrote:
> On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
> wrote:
>>> I know that there haven been many discussions on the slowness of count(*)
>>> even when an index is involved because the visibility of the rows has to
>>> be checked. In the past I have seen many suggestions about using
>>> triggers and tables to keep track of counts and while this works fine in
>>> a situation where you know what the report is going to be ahead of time,
>>> this is simply not an option when an unknown WHERE clause is to be used
>>> (dynamically generated). I ran into a fine example of this when I was
>>> searching this mailing list, "Searching in 856,646 pages took 13.48202
>>> seconds. Site search powered by PostgreSQL 8.3." Obviously at some point
>>> count(*) came into play here because the site made a list of pages (1 2
>>> 3 4 5 6>  next). I very commonly make a list of pages from search
>>> results, and the biggest time killer here is the count(*) portion, even
>>> worse yet, I sometimes have to hit the database with two SELECT
>>> statements, one with OFFSET and LIMIT to get the page of results I need
>>> and another to get the amount of total rows so I can estimate how many
>>> pages of results are available. The point I am driving at here is that
>>> since building a list of pages of results is such a common thing to do,
>>> there need to be some specific high speed ways to do this in one query.
>>> Maybe an estimate(*) that works like count but gives an answer from the
>>> index without checking visibility? I am sure that this would be good
>>> enough to make a page list, it is really no big deal if it errors on the
>>> positive side, maybe the list of pages has an extra page off the end. I
>>> can live with that. What I can't live with is taking 13 seconds to get a
>>> page of results from 850,000 rows in a table.
>>
>> 99% of the time in the situations you don't need an exact measure, and
>> assuming analyze has run recently, select rel_tuples from pg_class for
>> a given table is more than close enough.  I'm sure wrapping that in a
>> simple estimated_rows() function would be easy enough to do.
>
> This is a very good approach and it works very well when you are counting the
> entire table, but when you have no control over the WHERE clause, it doesn't
> help. IE: someone puts in a word to look for in a web form.

For that sort of thing, there isn't much that'll help you except
visibility-aware indexes, covering indexes, etc if/when they're
implemented. Even then, they'd only help when it was a simple
index-driven query with no need to hit the table to recheck any test
conditions, etc.

I guess there could be *some* way to expose the query planner's cost
estimates in a manner useful for result count estimation ... but given
how coarse its stats are and how wildly out the estimates can be, I kind
of doubt it. It's really intended for query planning decisions and more
interested in orders of magnitude, "0, 1, or more than that" measures,
etc, and seems to consider 30% here or there to be pretty insignificant
most of the time.

> It's bad enough that count(*) is slow, then you have to do it all over again
> to get the results you need! I have not dug into this much yet, but would it
> be possible to return the amount of rows that a WHERE clause would actually
> return if the LIMIT and OFFSET were not applied. IE: When a normal query is
> executed, the server returns the number of rows aside from the actual row
> data. Would it be a big deal to modify this to allow it to return the amount
> of rows before the LIMIT and OFFSET is applied as well?

It'd force the server to fully execute the query. Then again, it sounds
like you're doing that anyway.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

pgsql-performance by date:

Previous
From: Mladen Gogala
Date:
Subject: Re: large dataset with write vs read clients
Next
From: Craig Ringer
Date:
Subject: Re: Slow count(*) again...