Re: Improving count(*) - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Improving count(*)
Date
Msg-id 1132272483.4959.299.camel@localhost.localdomain
Whole thread Raw
In response to Re: Improving count(*)  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Improving count(*)
Re: Improving count(*)
Re: Improving count(*)
List pgsql-hackers
On Thu, 2005-11-17 at 16:30 -0600, Kevin Grittner wrote:
> In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
> Server) the leaf level of the narrowest index on the table is scanned,
> following a linked list of leaf pages.  Leaf pages can be pretty dense
> under Sybase, because they do use prefix compression.  A count(*)
> on a table with 100 million rows is going to take a few minutes, but it
> is going to be at least an order of magnitude faster than a data page
> scan -- maybe two orders of magnitude faster.
> 
> What I don't understand is why people need to do such things so
> frequently that it's a major issue, rather than an occassional
> annoyance.  

Agreed, completely. (And it galls me to agree with multiple, potentially
opposed opinions on my own thread).

The trouble is, people moan and constantly. Perhaps we should stick to
our guns and say, why do you care? From here, I think we should say,
"show me an application package that needs this so badly we'll change
PostgreSQL just for them". Prove it and we'll do it. Kinda polite in the
TODO, but I think we should put something in there that says "things we
haven't yet had any good reason to improve".

> A solution which not only helped the count(*) issue
> but also allowed index scans to skip the trip to the data page to
> see if it's an active version seems like it would boost performance
> overall.  As pointed out elsewhere, it could also allow new
> techniques for vacuum which could be beneficial.
> 
> My view is that when tables get so big that a count(*) takes that
> much time, you don't typiclally need an EXACT count anyway --
> you could normally check the statistics from your nightly analyze.

Amen.

>From here, another proposal. We have a GUC called count_uses_estimate
that is set to off by default. If set to true, then a count(*) will use
the planner logic to estimate number of rows in the table and return
that as the answer, rather than actually count the row. Unless analyze
statistics are not available, in which case it does the real count.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: Some array semantics issues
Next
From: Tom Lane
Date:
Subject: Re: Some array semantics issues