Thread: Performance of count(*) on large tables vs SQL Server

Performance of count(*) on large tables vs SQL Server

From
Andrew Mayo
Date:
Doing some rather crude comparative performance tests
between PG 8.0.1 on Windows XP and SQL Server 2000, PG
whips SQL Server's ass on

insert into junk (select * from junk)

on a one column table defined as int.
If we start with a 1 row table and repeatedly execute
this command, PG can take the table from 500K rows to
1M rows in 20 seconds; SQL Server is at least twice as
slow.

BUT...

SQL Server can do

select count(*) on junk

in almost no time at all, probably because this query
can be optimised to go back and use catalogue
statistics.

PG, on the other hand, appears to do a full table scan
to answer this question, taking nearly 4 seconds to
process the query.

Doing an ANALYZE on the table and also VACUUM did not
seem to affect this.

Can PG find a table's row count more efficiently?.
This is not an unusual practice in commercial
applications which assume that count(*) with no WHERE
clause will be a cheap query  - and use it to test if
a table is empty, for instance. (because for
Oracle/Sybase/SQL Server, count(*) is cheap).

(sure, I appreciate there are other ways of doing
this, but I am curious about the way PG works here).



Re: Performance of count(*) on large tables vs SQL Server

From
Shridhar Daithankar
Date:
On Tuesday 01 Feb 2005 6:11 pm, Andrew Mayo wrote:
> PG, on the other hand, appears to do a full table scan
> to answer this question, taking nearly 4 seconds to
> process the query.
>
> Doing an ANALYZE on the table and also VACUUM did not
> seem to affect this.
>
> Can PG find a table's row count more efficiently?.
> This is not an unusual practice in commercial
> applications which assume that count(*) with no WHERE
> clause will be a cheap query  - and use it to test if
> a table is empty, for instance. (because for
> Oracle/Sybase/SQL Server, count(*) is cheap).

First of all, such an assumption is no good. It should hit concurrency under
heavy load but I know people do use it.

For the specific question, after a vacuum analyze, you can use

select reltuples from pg_class where relname='Foo';

Remember, you will get different results between 'analyze' and 'vacuum
analyze', since later actually visit every page in the table and hence is
expected to be more accurate.

> (sure, I appreciate there are other ways of doing
> this, but I am curious about the way PG works here).

Answer is MVCC and PG's inability use index alone. This has been a FAQ for a
loong time.. Furthermore PG has custom aggregates to complicate the matter..

Most of the pg developers/users think that unqualified select count(*) is of
no use. You can search the archives for more details..

 HTH

 Shridhar

Re: Performance of count(*) on large tables vs SQL Server

From
Stef
Date:
Hello Andrew,
    Everything that Shridhar says makes perfect
sense, and, speaking from experience in dealing with
this type of 'problem', everything you say does as
well. Such is life really :)

    I would not be at -all- surprised if Sybase
and Oracle did query re-writing behind the scene's
to send un-defined count's to a temporary table which
holds the row count. For an example of such done in
postgreSQL (using triggers and a custom procedure)
look into the 'General Bits' newsletter. Specifically
http://www.varlena.com/varlena/GeneralBits/49.php

    I know, giving a URL as an answer 'sucks', but,
well, it simply repeats my experience. Triggers and
Procedures.

    Regards
    Steph

On Tue, Feb 01, 2005 at 06:32:56PM +0530, Shridhar Daithankar wrote:
> On Tuesday 01 Feb 2005 6:11 pm, Andrew Mayo wrote:
> > PG, on the other hand, appears to do a full table scan
> > to answer this question, taking nearly 4 seconds to
> > process the query.
> >
> > Doing an ANALYZE on the table and also VACUUM did not
> > seem to affect this.
> >
> > Can PG find a table's row count more efficiently?.
> > This is not an unusual practice in commercial
> > applications which assume that count(*) with no WHERE
> > clause will be a cheap query  - and use it to test if
> > a table is empty, for instance. (because for
> > Oracle/Sybase/SQL Server, count(*) is cheap).
>
> First of all, such an assumption is no good. It should hit concurrency under
> heavy load but I know people do use it.
>
> For the specific question, after a vacuum analyze, you can use
>
> select reltuples from pg_class where relname='Foo';
>
> Remember, you will get different results between 'analyze' and 'vacuum
> analyze', since later actually visit every page in the table and hence is
> expected to be more accurate.
>
> > (sure, I appreciate there are other ways of doing
> > this, but I am curious about the way PG works here).
>
> Answer is MVCC and PG's inability use index alone. This has been a FAQ for a
> loong time.. Furthermore PG has custom aggregates to complicate the matter..
>
> Most of the pg developers/users think that unqualified select count(*) is of
> no use. You can search the archives for more details..
>
>  HTH
>
>  Shridhar
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Attachment

Re: Performance of count(*) on large tables vs SQL Server

From
PFC
Date:

> clause will be a cheap query  - and use it to test if
> a table is empty, for instance. (because for
> Oracle/Sybase/SQL Server, count(*) is cheap).

    To test if a table is empty, use a SELECT EXISTS or whatever SELECT with
a LIMIT 1...