Thread: docs update for count(*) and index-only scans

docs update for count(*) and index-only scans

From
Josh Kupershmidt
Date:
Hi all,

func.sgml still claims that a sequential scan is the only way to
execute a SELECT COUNT(*) query. I think this note should just be
removed from the current docs, given the existence of index-only
scans; patch attached.

I didn't see any other outright incorrect spots regarding COUNT(*).
There are two EXPLAIN outputs for COUNT(*) queries, in
auto-explain.sgml and ddl.sgml, though those are probably fine as-is.
Josh

Attachment

Re: docs update for count(*) and index-only scans

From
Tom Lane
Date:
Josh Kupershmidt <schmiddy@gmail.com> writes:
> func.sgml still claims that a sequential scan is the only way to
> execute a SELECT COUNT(*) query. I think this note should just be
> removed from the current docs, given the existence of index-only
> scans; patch attached.

Well, it might need adjustment, but I don't think we should remove it
outright.  The people who complain that COUNT(*) is not O(1) are still
going to be complaining.  On tables that are not read-mostly, there's
no reason to expect that index-only scans will even provide a material
speed boost, let alone be close to free.

            regards, tom lane

Re: docs update for count(*) and index-only scans

From
Josh Kupershmidt
Date:
On Tue, Nov 1, 2011 at 6:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Kupershmidt <schmiddy@gmail.com> writes:
>> func.sgml still claims that a sequential scan is the only way to
>> execute a SELECT COUNT(*) query. I think this note should just be
>> removed from the current docs, given the existence of index-only
>> scans; patch attached.
>
> Well, it might need adjustment, but I don't think we should remove it
> outright.  The people who complain that COUNT(*) is not O(1) are still
> going to be complaining.  On tables that are not read-mostly, there's
> no reason to expect that index-only scans will even provide a material
> speed boost, let alone be close to free.

Yeah, that's all true. I'd be OK with an adjustment along the lines of
"note: COUNT(*) can be expensive, use judiciously".

But the tone of the existing note suggests that users "may be
surprised" that our COUNT(*) is slower than other RDBMSs. So I guess
I'm wondering, are we really still that much slower than our
competitors for COUNT(*)? Ignoring MyISAM and similar lobotomized
engines, does any competitor have a much-faster way?

Josh

Re: docs update for count(*) and index-only scans

From
Tom Lane
Date:
Josh Kupershmidt <schmiddy@gmail.com> writes:
> On Tue, Nov 1, 2011 at 6:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, it might need adjustment, but I don't think we should remove it
>> outright.  The people who complain that COUNT(*) is not O(1) are still
>> going to be complaining.  On tables that are not read-mostly, there's
>> no reason to expect that index-only scans will even provide a material
>> speed boost, let alone be close to free.

> But the tone of the existing note suggests that users "may be
> surprised" that our COUNT(*) is slower than other RDBMSs. So I guess
> I'm wondering, are we really still that much slower than our
> competitors for COUNT(*)? Ignoring MyISAM and similar lobotomized
> engines, does any competitor have a much-faster way?

It's the "lobotomized engines" that are the problem, IMO --- people
coming from databases like mysql tend to think count(*) just means
reading a table size counter that the engine has anyway.

I'm not really sure how our index-only scans stack up against the likes
of Oracle.  For a read-mostly table they probably have to do a
comparable amount of work, but I'm not real sure how occasional heap
visits (for us) will compare to occasional rollback segement visits (for
them).  And they may be able to scan their indexes in physical order,
which would help a fair amount.  And the whole code path is probably a
good bit more mature and better optimized on that side of the fence,
too.

            regards, tom lane

Re: docs update for count(*) and index-only scans

From
Robert Haas
Date:
On Tue, Nov 1, 2011 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's the "lobotomized engines" that are the problem, IMO --- people
> coming from databases like mysql tend to think count(*) just means
> reading a table size counter that the engine has anyway.

This is probably a much less common misconception than formerly, due
to the rise of InnoDB and the falling-out-of-favor experienced by
MyISAM.

I think some pessimism removal is probably warranted.  Yeah, somebody
else might be faster than us on this test, but that's probably true of
many tests.  And on others we will be faster than them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: docs update for count(*) and index-only scans

From
Scott Marlowe
Date:
On Fri, Nov 4, 2011 at 2:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 1, 2011 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's the "lobotomized engines" that are the problem, IMO --- people
>> coming from databases like mysql tend to think count(*) just means
>> reading a table size counter that the engine has anyway.
>
> This is probably a much less common misconception than formerly, due
> to the rise of InnoDB and the falling-out-of-favor experienced by
> MyISAM.

True.  For instance the Drizzle project simply yanked myisam as an
engine out of mysql for their fork, and made innodb the standard
default table handler.

Re: docs update for count(*) and index-only scans

From
Josh Kupershmidt
Date:
On Fri, Nov 4, 2011 at 4:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 1, 2011 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's the "lobotomized engines" that are the problem, IMO --- people
>> coming from databases like mysql tend to think count(*) just means
>> reading a table size counter that the engine has anyway.
>
> This is probably a much less common misconception than formerly, due
> to the rise of InnoDB and the falling-out-of-favor experienced by
> MyISAM.
>
> I think some pessimism removal is probably warranted.  Yeah, somebody
> else might be faster than us on this test, but that's probably true of
> many tests.  And on others we will be faster than them.

So, if Tom still wants to keep that paragraph and its warning, how
about we just fix the now-incorrect bit at the end? Maybe just tweak
"will be executed" to "will often be executed", or change "using a
sequential scan of the entire table." to "using a sequential scan of
the table, or an index-only scan of one of its indexes".

Josh

Re: docs update for count(*) and index-only scans

From
Marti Raudsepp
Date:
On Tue, Dec 6, 2011 at 05:14, Josh Kupershmidt <schmiddy@gmail.com> wrote:
> Maybe just tweak
> "will be executed" to "will often be executed", or change "using a
> sequential scan of the entire table." to "using a sequential scan of
> the table, or an index-only scan of one of its indexes".

I don't think we need to specify what the planner does at all. How
about simply "will need to access all rows [in the table]"

Also +1 for removing references to "other SQL databases".

Regards,
Marti

Re: docs update for count(*) and index-only scans

From
Robert Haas
Date:
On Tue, Dec 6, 2011 at 11:27 AM, Marti Raudsepp <marti@juffo.org> wrote:
> On Tue, Dec 6, 2011 at 05:14, Josh Kupershmidt <schmiddy@gmail.com> wrote:
>> Maybe just tweak
>> "will be executed" to "will often be executed", or change "using a
>> sequential scan of the entire table." to "using a sequential scan of
>> the table, or an index-only scan of one of its indexes".
>
> I don't think we need to specify what the planner does at all. How
> about simply "will need to access all rows [in the table]"
>
> Also +1 for removing references to "other SQL databases".

I just modified this so that it's not outright wrong any more.  I
think it's still more pessimistic than is warranted, but I wasn't sure
exactly how to rephrase it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company