Thread: How to optimize select count(*)..group by?
We have this simple query:
select status, count(*) from customer group by status;
There is already a btree index on status, but, the customer table is huge, and this query must be executed very frequently... an "explain" on this query shows that it is quite costly (and we notice it runs slowly)...
Can someone recommend the best technique to optimize this? We can create new indices, we can re-write this query.. But we'd rather not add new tables or columns if possible (not just to solve this problem).
Thank you,
Bryan
select status, count(*) from customer group by status;
There is already a btree index on status, but, the customer table is huge, and this query must be executed very frequently... an "explain" on this query shows that it is quite costly (and we notice it runs slowly)...
Can someone recommend the best technique to optimize this? We can create new indices, we can re-write this query.. But we'd rather not add new tables or columns if possible (not just to solve this problem).
Thank you,
Bryan
On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote: > We have this simple query: > > select status, count(*) from customer group by status; > > There is already a btree index on status, but, the customer table is > huge, and this query must be executed very frequently... an > "explain" on this query shows that it is quite costly (and we notice > it runs slowly)... > > Can someone recommend the best technique to optimize this? We can > create new indices, we can re-write this query.. But we'd rather not > add new tables or columns if possible (not just to solve this > problem). You're pretty much stuck with either writing triggers that modify a cache table or having your performance the way it is now. Cheers, D -- David Fetter david@fetter.org http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote!
Bryan Field-Elliot wrote: > We have this simple query: > > select status, count(*) from customer group by status; > > There is already a btree index on status, but, the customer table is > huge, and this query must be executed very frequently... an "explain" on > this query shows that it is quite costly (and we notice it runs slowly)... > > Can someone recommend the best technique to optimize this? We can create > new indices, we can re-write this query.. But we'd rather not add new > tables or columns if possible (not just to solve this problem). There's no other accurate solution. While PG's MVCC system means you need less locking, it makes it tricky to determine whether a row is visible without actually checking on disk. Simplest solution - have a "customer_status_summary" table and add a trigger to the "customer" table to keep it up to date. That way, you'll have extremely cheap counts. Make sure you understand what locking you'll need with your particular solution. -- Richard Huxton Archonet Ltd
David Fetter wrote: > On Thu, Jul 28, 2005 at 09:19:49AM -0700, Bryan Field-Elliot wrote: > >>We have this simple query: >> >>select status, count(*) from customer group by status; >> >>There is already a btree index on status, but, the customer table is >>huge, and this query must be executed very frequently... an >>"explain" on this query shows that it is quite costly (and we notice >>it runs slowly)... >> >>Can someone recommend the best technique to optimize this? We can >>create new indices, we can re-write this query.. But we'd rather not >>add new tables or columns if possible (not just to solve this >>problem). > > > You're pretty much stuck with either writing triggers that modify a > cache table or having your performance the way it is now. > > Cheers, > D How about the new bitmap index? I wonder if that'll result in better performance for that type of query? -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________
Bryan Field-Elliot <bryan_lists@netmeme.org> writes: > We have this simple query: > > select status, count(*) from customer group by status; > > There is already a btree index on status, but, the customer table is > huge, and this query must be executed very frequently... an "explain" on > this query shows that it is quite costly (and we notice it runs > slowly)... > > Can someone recommend the best technique to optimize this? We can create > new indices, we can re-write this query.. But we'd rather not add new > tables or columns if possible (not just to solve this problem). Without creating a new table you have three options, none of which will be very fast: 1) If the entire table fits in ram then you could try setting random_page_cost close enough to 1 to cause the index to be used. 2) If it doesn't but there are a reasonably small number of distinct values of status you would hope to see a sequential scan and a hash aggregate being used. I would expect this to be what you would see with the default configuration. If not you might have to increase work_mem (or sort_mem depending on your version I think). 3) If you have a large table with a large set of status values then Postgres may be compelled to resort the entire table. In which case you should experiment with work_mem/sort_mem to get the largest value you can without inducing swapping. You could also experiment with putting pgsql_tmp on separate spindles. None of these will be fast enough for OLTP access like a web page. You would be best off performing this query periodically and stuffing the results in a table for quick access. -- greg
On Thu, 2005-07-28 at 11:19, Bryan Field-Elliot wrote: > We have this simple query: > > select status, count(*) from customer group by status; > > There is already a btree index on status, but, the customer table is > huge, and this query must be executed very frequently... an "explain" > on this query shows that it is quite costly (and we notice it runs > slowly)... > > Can someone recommend the best technique to optimize this? We can > create new indices, we can re-write this query.. But we'd rather not > add new tables or columns if possible (not just to solve this > problem). With a query of the form: select field,count([field|*]) from table WHERE somefield = somecondition; the query planner is going to have to scan every single row returned by that where clause. There's no shortcut, because the visibility rules of MVCC means you have to look at every tuple IN THE TABLE, not in the index (it's the way postgresql is built, and it isn't likely to change soon, because putting the visibility information in indexes is expensive, and would result in VERY slow updates and very large indexes). So, the best optimization is to use a selective where clause. If you run the query with a where clause of something like: where processdate between '01 july 2005' and '07 july 2005' then you should get better performance.