Thread: performance of count(*)

performance of count(*)

From
Scott Ribe
Date:
I need to optimize queries that deal with some aggregates regarding resource availability. My specific problem is, I
think,very closely analogous to select count(*)... where... 

I know roughly how to do it, aggregated stats table, triggers appending to it, occasional updates to coalesce entries.
I'djust like to see an example to confirm my own plan and see if I'm missing any details. 

I'm sure I've seen references to articles on ways to do this, but all google is getting me is generic complaints about
count(*)performance and suggestions to use stats for estimated total rows in a table, nothing useful for this. 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: performance of count(*)

From
Tomas Vondra
Date:
Dne 6.5.2011 20:45, Scott Ribe napsal(a):
> I need to optimize queries that deal with some aggregates regarding
> resource availability. My specific problem is, I think, very closely
> analogous to select count(*)... where...
>
> I know roughly how to do it, aggregated stats table, triggers
> appending to it, occasional updates to coalesce entries. I'd just like
> to see an example to confirm my own plan and see if I'm missing any
> details.
>
> I'm sure I've seen references to articles on ways to do this, but all
> google is getting me is generic complaints about count(*) performance
> and suggestions to use stats for estimated total rows in a table,
> nothing useful for this.
>

Well I guess you got most of that right - just create a table to hold
aggregated values, and then a bunch of triggers to update it. The
structute and implementation of the triggers really depend on your
needs, but in general there are two approaches - eager and lazy.

Eager - the triggers immediately update the aggregates (increment a
count, add a value to the sum etc.). This is a 'primitive' and less
complex solution, suitable when the update is cheap and the aggregated
table is often read.

Lazy - just mark the row in the aggregated table as 'dirty' and
recompute it only if it's read. This is suitable if the table is read
only occasionaly and/or when computing the aggregate is complex (e.g.
needs to reread the whole dataset - as for example computing variance).

The lazy approach is not that usual, a great example how to implement
that is available here:

http://www.pgcon.org/2008/schedule/events/69.en.html

Anyway I'd recommend to start with the eager approach, it's much easier
to implement. You can implement the lazy approach later, if you find out
it's needed.

And you should strive to use HOT feature (if you're on >= 8.4),
especially with the eager approach - it often does a lot of updates and
leads to bloat of the aggregated table. So decrease the fillfactor and
do not index the columns that are updated by the triggers.

regards
Tomas

Re: performance of count(*)

From
Andrew Sullivan
Date:
On Fri, May 06, 2011 at 12:45:23PM -0600, Scott Ribe wrote:

> I need to optimize queries that deal with some aggregates regarding
  resource availability. My specific problem is, I think, very closely
  analogous to select count(*)... where...

If the WHERE clause is fairly selective and indexed, that should be
fast.  Not as fast as estimates based on trigger-written values in
another table, of course, but reasonably fast.  So the first order of
business is usually to find or create indexes that will make SELECT on
the same criteria fast.

It's only unqualified "SELECT count(*)" that is slow.  Generally, the
system table is good enough for that, I find.  (Someone: "How long
will this take?"  Me: "There are about 400 million rows to go
through."  Even if you're off by 50 million at that point, it doesn't
matter.)

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: performance of count(*)

From
Scott Ribe
Date:
On May 6, 2011, at 1:39 PM, Tomas Vondra wrote:

> Anyway I'd recommend to start with the eager approach, it's much easier
> to implement. You can implement the lazy approach later, if you find out
> it's needed.

With the eager approach, I think I'm too likely to get write conflicts. Thanks for the reference to the paper, I
believethat's what I was looking for. 

> And you should strive to use HOT feature (if you're on >= 8.4),
> especially with the eager approach - it often does a lot of updates and
> leads to bloat of the aggregated table. So decrease the fillfactor and
> do not index the columns that are updated by the triggers.

See, that's the kind of info I'm looking for ;-)

On May 6, 2011, at 1:59 PM, Andrew Sullivan wrote:

> If the WHERE clause is fairly selective and indexed, that should be
> fast.  Not as fast as estimates based on trigger-written values in
> another table, of course, but reasonably fast.  So the first order of
> business is usually to find or create indexes that will make SELECT on
> the same criteria fast.

In this case, it depends on the result of a pretty complex join that involves some gnarly time calculations, and
findingthe unmatched rows from one side of an outer join. I really don't think there's a way to optimize the
straight-upquery to be faster than it is, I looked at that for a good long time, explain/analyze and all. Postgres is
usingthe appropriate index to narrow things down as much as it can at the very beginning, it just then has to perform a
heckof a lot of work to finish the join... And it's not taking ***that*** long--it's just that I want it faster! 

> It's only unqualified "SELECT count(*)" that is slow.  Generally, the
> system table is good enough for that, I find.  (Someone: "How long
> will this take?"  Me: "There are about 400 million rows to go
> through."  Even if you're off by 50 million at that point, it doesn't
> matter.)

FYI, I have no need for unqualified select count(*) in this app--just doesn't happen, ever ;-)

Thanks.

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: performance of count(*)

From
Andrew Sullivan
Date:
On Fri, May 06, 2011 at 03:43:02PM -0600, Scott Ribe wrote:

> In this case, it depends on the result of a pretty complex join that
> involves some gnarly time calculations, and finding the unmatched
> rows from one side of an outer join.

Yeah, in that case the HOT suggestions are very important.  I strongly
recomment you experiment in a test system with real data and
pathological cases in particular, in order to see what happens when
the outlier cases inevitably, Murphy willing, crop up.  That's not to
say you should arrange your plans for them, but forewarned is
forearmed.

A


--
Andrew Sullivan
ajs@crankycanuck.ca

Re: performance of count(*)

From
Scott Ribe
Date:
On May 6, 2011, at 4:15 PM, Andrew Sullivan wrote:

> Yeah, in that case the HOT suggestions are very important.  I strongly
> recomment you experiment in a test system with real data and
> pathological cases in particular, in order to see what happens when
> the outlier cases inevitably, Murphy willing, crop up.  That's not to
> say you should arrange your plans for them, but forewarned is
> forearmed.

Again thanks. The HOT tip led me down the road of paying attention to my indexes, which led me to a nice realization
abouthow to shrink the overall footprint of the materialized aggregates ;-) Which led me to a technique to seriously
minimizeupdates... 

I didn't have to worry about bloat too much--overall activity level is not huge; the possibility of collisions on
updatesis mostly because users tend to work on the same very small (but ever-shifting) subset of the data at the same
time,but now I think I'm really set! 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice