Thread: How to optimize select count(*)..group by?

How to optimize select count(*)..group by?

From
Bryan Field-Elliot
Date:
We have this simple query:

select status, count(*) from customer group by status;

There is already a btree index on status, but, the customer table is huge, and this query must be executed very frequently... an "explain" on this query shows that it is quite costly (and we notice it runs slowly)...

Can someone recommend the best technique to optimize this? We can create new indices, we can re-write this query.. But we'd rather not add new tables or columns if possible (not just to solve this problem).

Thank you,

Bryan

Re: How to optimize select count(*)..group by?

From
David Fetter
Date:
On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote:
> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an
> "explain" on this query shows that it is quite costly (and we notice
> it runs slowly)...
>
> Can someone recommend the best technique to optimize this? We can
> create new indices, we can re-write this query.. But we'd rather not
> add new tables or columns if possible (not just to solve this
> problem).

You're pretty much stuck with either writing triggers that modify a
cache table or having your performance the way it is now.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

Re: How to optimize select count(*)..group by?

From
Richard Huxton
Date:
Bryan Field-Elliot wrote:
> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an "explain" on
> this query shows that it is quite costly (and we notice it runs slowly)...
>
> Can someone recommend the best technique to optimize this? We can create
> new indices, we can re-write this query.. But we'd rather not add new
> tables or columns if possible (not just to solve this problem).

There's no other accurate solution. While PG's MVCC system means you
need less locking, it makes it tricky to determine whether a row is
visible without actually checking on disk.

Simplest solution - have a "customer_status_summary" table and add a
trigger to the "customer" table to keep it up to date. That way, you'll
have extremely cheap counts. Make sure you understand what locking
you'll need with your particular solution.

--
   Richard Huxton
   Archonet Ltd

Re: How to optimize select count(*)..group by?

From
Bricklen Anderson
Date:
David Fetter wrote:
> On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote:
>
>>We have this simple query:
>>
>>select status, count(*) from customer group by status;
>>
>>There is already a btree index on status, but, the customer table is
>>huge, and this query must be executed very frequently... an
>>"explain" on this query shows that it is quite costly (and we notice
>>it runs slowly)...
>>
>>Can someone recommend the best technique to optimize this? We can
>>create new indices, we can re-write this query.. But we'd rather not
>>add new tables or columns if possible (not just to solve this
>>problem).
>
>
> You're pretty much stuck with either writing triggers that modify a
> cache table or having your performance the way it is now.
>
> Cheers,
> D
How about the new bitmap index? I wonder if that'll result in better performance
for that type of query?

--
_______________________________

This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________

Re: How to optimize select count(*)..group by?

From
Greg Stark
Date:
Bryan Field-Elliot <bryan_lists@netmeme.org> writes:

> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an "explain" on
> this query shows that it is quite costly (and we notice it runs
> slowly)...
>
> Can someone recommend the best technique to optimize this? We can create
> new indices, we can re-write this query.. But we'd rather not add new
> tables or columns if possible (not just to solve this problem).

Without creating a new table you have three options, none of which will be
very fast:

1) If the entire table fits in ram then you could try setting random_page_cost
   close enough to 1 to cause the index to be used.

2) If it doesn't but there are a reasonably small number of distinct values of
   status you would hope to see a sequential scan and a hash aggregate being
   used. I would expect this to be what you would see with the default
   configuration. If not you might have to increase work_mem (or sort_mem
   depending on your version I think).

3) If you have a large table with a large set of status values then Postgres
   may be compelled to resort the entire table. In which case you should
   experiment with work_mem/sort_mem to get the largest value you can without
   inducing swapping. You could also experiment with putting pgsql_tmp on
   separate spindles.

None of these will be fast enough for OLTP access like a web page. You would
be best off performing this query periodically and stuffing the results in a
table for quick access.



--
greg

Re: How to optimize select count(*)..group by?

From
Scott Marlowe
Date:
On Thu, 2005-07-28 at 11:19, Bryan Field-Elliot wrote:
> We have this simple query:
>
> select status, count(*) from customer group by status;
>
> There is already a btree index on status, but, the customer table is
> huge, and this query must be executed very frequently... an "explain"
> on this query shows that it is quite costly (and we notice it runs
> slowly)...
>
> Can someone recommend the best technique to optimize this? We can
> create new indices, we can re-write this query.. But we'd rather not
> add new tables or columns if possible (not just to solve this
> problem).

With a query of the form:

select field,count([field|*]) from table WHERE somefield =
somecondition;

the query planner is going to have to scan every single row returned by
that where clause.  There's no shortcut, because the visibility rules of
MVCC means you have to look at every tuple IN THE TABLE, not in the
index (it's the way postgresql is built, and it isn't likely to change
soon, because putting the visibility information in indexes is
expensive, and would result in VERY slow updates and very large
indexes).

So, the best optimization is to use a selective where clause.

If you run the query with a where clause of something like:

where processdate between '01 july 2005' and '07 july 2005'

then you should get better performance.