Re: Improving count(*) - Mailing list pgsql-hackers

From Gregory Maxwell
Subject Re: Improving count(*)
Date
Msg-id e692861c0511181156q3a59d125qc6a669310699144b@mail.gmail.com
Whole thread Raw
In response to Re: Improving count(*)  ("Merlin Moncure" <merlin.moncure@rcsonline.com>)
Responses Re: Improving count(*)  ("Jim C. Nasby" <jnasby@pervasive.com>)
List pgsql-hackers
On 11/18/05, Merlin Moncure <merlin.moncure@rcsonline.com> wrote:
> > In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
> > Server) the leaf level of the narrowest index on the table is scanned,
> > following a linked list of leaf pages.  Leaf pages can be pretty dense
> > under Sybase, because they do use prefix compression.  A count(*)
> > on a table with 100 million rows is going to take a few minutes, but
> it
> > is going to be at least an order of magnitude faster than a data page
> > scan -- maybe two orders of magnitude faster.
>
> MS SQL server (pre 2005) is not an MVCC database, so it's not apples to
> apples with pg.  Many of the people who wander on this list and complain
> about count(*) either come from one of those or some other non-MVCC
> database or worse, a flat-file xbase type system.  A performance
> comparison between MS 2005 and pg would be much more interesting.
> Personally, I don't know what all the fuss is about [although I wouldn't
> complain about an optimization ;)].

count(*) WHERE 1  is indeed a corner case that few to no real
applications should care about... If we were having to choose between
improving that case and preserving the performance and maintainability
of PG then I think the discussion would already be over.

However, some great ideas have been proposed here which would not only
help in that case but would otherwise be quite useful.

*Inclusion of a 'MVCC inflight' bit in indexes which would allow
skipping MVCC checks in clumps of an index scan which have no pending
changes. This would further close the performance gap between PG and
non-MVCC databases for some workloads.
*Introduction of high performance table sampling, which would be
useful in many applications (including counting where there is a where
clause) as well as for testing and adhoc queries.
and
*a estimate_count() that provides the planner estimate, which would
return right away and provide what is really needed most of the time
people try to count(*) on a large table.

So, while this is a silly case to optimize for it's one where it
appears that the proposed solutions will make PG better all around.


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Improving count(*)
Next
From: Greg Stark
Date:
Subject: Re: order by, for custom types