Home > mailing lists

Re: Strange count(*) implementation? - Mailing list pgsql-general

From	Henk Ernst Blok
Subject	Re: Strange count(*) implementation?
Date	October 26, 2004 17:25:15
Msg-id	417E5031.8040409@utwente.nl Whole thread Raw
In response to	Re: Strange count(*) implementation? (Tino Wildenhain <tino@wildenhain.de>)
Responses	Re: Strange count(*) implementation? (Tino Wildenhain <tino@wildenhain.de>)
List	pgsql-general

Tree view

Hi Tino,

Tino Wildenhain wrote:

I assume(d) the more expensive statistics (e.g., value distribution
info) are updated only when outdated too much or on request (manual
vacuum). Usually, other/cheap statistics can easily be maintained
incrementally and thus reflect actual table state after each update.

I remember some discussion about this. But also here the MVCC and
the general call for performance leads to the current solution 
where statistics are only updated on vacuum. With PG8.0 you have
vacuum strategies with better performance which can run more often
as I understand - so while not giving you exact figures your
count() could be estimated at least.

I need exact figures at the moment due to the type of some scientific experiments I'm running. Approximations are in the pipeline but I need a base-line run without any tricks that affect the accuracy of the numbers involved.

 Of course, the MVCC principle seems to make things a bit more
complicated I understand now. But tracking whether statistics are
dirty has to be in the system anyway. How does it otherwise decide
when to do its own statistics updates? So  if explain can get the most
recent count, why not use it in the count as well if you know the
statistics are still acurate?

The point is: you dont know it. There is curently no: mark statistics
dirty if table has new tuples or tuples removed.

OK. Didn't know that, but got the impression by now it worked that way in Postgres.

By the way, a count(*) without any where does occur very frequently if
you are dealing with an OLAP load, which is the case in my setting.
So, I indeed already 'fixed' the performance problem by precomputing
all counts I can predict to be of any use.

I'm not familar with OLAP specifics, so what is the meaning of the
count() here? What is done with this information?

OLAP stands for OnLine Analystical Processing, typically meaning heavy queries with a lot of data (for typical examples, see: http://www.tpc.org/
the TPC-H query set in particular). So decision support and datamining are in that area for instance. My topic of interest is IR (information retrieval) in a database context. My experiments behave like an OLAP load at the moment. My current experiments involve a lot of counting and expensive joins as I have to compute certain estimators in a mathematical model I'm working on, hence the importance of the count... ;)
On MySQL each of the 30 queries I have to run took on average about 24 h. As my queries are getting even complexer I'm now trying to find out whether Postgres can do a better job.

Regards,

Henk Ernst

-- 
address: DB group, Computer Science, EEMCS Dept., University of Twente,        PO Box 217, 7500 AE, ENSCHEDE, THE NETHERLANDS
phone:   ++31 (0)53 489 3754 (if no response: 3690)
email:   h.e.blok@utwente.nl
WWW:     http://www.cs.utwente.nl/~blokh

pgsql-general by date:

From: Mike Mascari
Date: 26 October 2004, 17:18:57
Subject: Re: Any plans on allowing user-defined triggers to be deferrable?

From: Tino Wildenhain
Date: 26 October 2004, 17:38:30
Subject: Re: Strange count(*) implementation?

Re: Strange count(*) implementation? - Mailing list pgsql-general

Previous

Next