Re: Strange count(*) implementation? - Mailing list pgsql-general

From Henk Ernst Blok
Subject Re: Strange count(*) implementation?
Date
Msg-id 417E5031.8040409@utwente.nl
Whole thread Raw
In response to Re: Strange count(*) implementation?  (Tino Wildenhain <tino@wildenhain.de>)
Responses Re: Strange count(*) implementation?  (Tino Wildenhain <tino@wildenhain.de>)
List pgsql-general
Hi Tino,

Tino Wildenhain wrote:
I assume(d) the more expensive statistics (e.g., value distribution
info) are updated only when outdated too much or on request (manual
vacuum). Usually, other/cheap statistics can easily be maintained
incrementally and thus reflect actual table state after each update.   
I remember some discussion about this. But also here the MVCC and
the general call for performance leads to the current solution 
where statistics are only updated on vacuum. With PG8.0 you have
vacuum strategies with better performance which can run more often
as I understand - so while not giving you exact figures your
count() could be estimated at least. 
I need exact figures at the moment due to the type of some scientific experiments I'm running. Approximations are in the pipeline but I need a base-line run without any tricks that affect the accuracy of the numbers involved.
 Of course, the MVCC principle seems to make things a bit more
complicated I understand now. But tracking whether statistics are
dirty has to be in the system anyway. How does it otherwise decide
when to do its own statistics updates? So  if explain can get the most
recent count, why not use it in the count as well if you know the
statistics are still acurate?    
The point is: you dont know it. There is curently no: mark statistics
dirty if table has new tuples or tuples removed. 
OK. Didn't know that, but got the impression by now it worked that way in Postgres.
By the way, a count(*) without any where does occur very frequently if
you are dealing with an OLAP load, which is the case in my setting.
So, I indeed already 'fixed' the performance problem by precomputing
all counts I can predict to be of any use.   
I'm not familar with OLAP specifics, so what is the meaning of the
count() here? What is done with this information? 
OLAP stands for OnLine Analystical Processing, typically meaning heavy queries with a lot of data (for typical examples, see: http://www.tpc.org/
the TPC-H query set in particular). So decision support and datamining are in that area for instance. My topic of interest is IR (information retrieval) in a database context. My experiments behave like an OLAP load at the moment. My current experiments involve a lot of counting and expensive joins as I have to compute certain estimators in a mathematical model I'm working on, hence the importance of the count... ;)
On MySQL each of the 30 queries I have to run took on average about 24 h. As my queries are getting even complexer I'm now trying to find out whether Postgres can do a better job.

Regards,


Henk Ernst
-- 
address: DB group, Computer Science, EEMCS Dept., University of Twente,        PO Box 217, 7500 AE, ENSCHEDE, THE NETHERLANDS
phone:   ++31 (0)53 489 3754 (if no response: 3690)
email:   h.e.blok@utwente.nl
WWW:     http://www.cs.utwente.nl/~blokh

pgsql-general by date:

Previous
From: Mike Mascari
Date:
Subject: Re: Any plans on allowing user-defined triggers to be deferrable?
Next
From: Tino Wildenhain
Date:
Subject: Re: Strange count(*) implementation?