Re: Yet Another COUNT(*)...WHERE...question - Mailing list pgsql-general

From Bill Moran
Subject Re: Yet Another COUNT(*)...WHERE...question
Date
Msg-id 20070815115738.352012cb.wmoran@potentialtech.com
Whole thread Raw
In response to Re: Yet Another COUNT(*)...WHERE...question  ("Phoenix Kiula" <phoenix.kiula@gmail.com>)
List pgsql-general
In response to "Phoenix Kiula" <phoenix.kiula@gmail.com>:

> > Yes, optimization. :)  You don't need an exact count to tell someone
> > that there's more data and they can go to it.
>
>
> In general, I agree. But my example of Amazon was only to illustrate
> the point about two queries and why they may be needed. I seem to see
> many more pages than you do, but in any case, Google and Amazon can
> afford to be less precise.
>
> Thanks for the suggestion of using EXPLAIN and parsing an
> approximation, but when you need to show a trader how many trades he
> has made, for instance, then approximation is not a possibility at
> all. Especially not if the numbers sway so wildly --
>
>
> FIRSTDB=# explain select * from trades where t_id = 'kXjha';
>                                     QUERY PLAN
> -----------------------------------------------------------------------------------
>  Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224)
>    Recheck Cond: ((t_id)::text = 'kXjha'::text)
>    ->  Bitmap Index Scan on trades_tid_date  (cost=0.00..15.67 rows=374 width=0)
>          Index Cond: ((t_id)::text = 'kXjha'::text)
> (4 rows)
>
> FIRSTDB=# select count(*) from trades where t_id = 'kXjha';
>  count
> -------
>   3891
> (1 row)
>
>
>
> Could I do something so that the EXPLAIN showed up with slightly more
> close-to-accurate stats? The above query is just after a "vacuum
> analyze"!

In the above case, you could probably materialize the data with a trigger
that updates a counter in a separate table every time a new trade is
added.  This will give you 100% accurate results with _very_ fast
response time.

Part of the problem is that there's no one answer to your question, there
are multiple approaches to solving it, depending on the details of the
problem and the acceptable time/accuracy of the answers.  Some basic
approaches:

1) Materialize the data.  MySQL actually does this automatically for you
   with MyISAM tables, which is why count(*) is so fast.  But if you
   absolutely need fast, accurate counts, you can build your own
   triggers in PG.  This is unlikely to be practical with all queries.
2) Estimate.  The accuracy of estimates can vary wildly by query and
   how often the database is analyzed, etc.  For something like,
   "show results 1 - 10 of about 50,000", estimates are great and fast,
   but for other cases, not acceptable.  The good news is you can get
   a fast estimate from any query with no up-front work.
3) Accept that sometimes to get accurate answers it's going to take
   time.  Around here, we call it the "Orbitz" technique, because when
   we discuss it, everyone thinks of the "please wait while I process
   your query" page you get from orbitz.com.  You'd be surprised how
   willing your users are to wait, as long as they know they have to
   wait.
4) Throw more hardware at it.  If you absolutely _must_have_ super-
   accurate results faster, then you may need to buy more RAM, faster
   disks and faster CPUs to accomplish it.
5) Come up with something revolutionary that nobody's every thought of
   before.  Good luck with this one.

Of course, all of these ideas are only practical if you've already
ensured that your system is properly tuned.  Crappy values for
shared_buffers and other tuning will lead you to waste time trying
to redesign something that should work just fine, so verify all your
configuration first.  You may be able to get more acceptable estimates
by increasing your statistics targets, for example.

--
Bill Moran
http://www.potentialtech.com

pgsql-general by date:

Previous
From: Decibel!
Date:
Subject: Re: Cluster and MVCC
Next
From: Erik Jones
Date:
Subject: Re: Yet Another COUNT(*)...WHERE...question