Thread: COUNT(*) and index-only scans

COUNT(*) and index-only scans

From
Bruce Momjian
Date:
I talked to Robert Haas and he said that index-only scans do not
optimize COUNT(*).  Is this something we can do for PG 9.2?  Is anyone
working on this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: COUNT(*) and index-only scans

From
Thom Brown
Date:
On 10 October 2011 18:23, Bruce Momjian <bruce@momjian.us> wrote:
> I talked to Robert Haas and he said that index-only scans do not
> optimize COUNT(*).  Is this something we can do for PG 9.2?  Is anyone
> working on this?

Yes it does, provided that there is an appropriate WHERE clause.  But
yes, I think we definitely want this if it's relatively easy.  In
addition to this, it's not always easy to get it to use an index-only
scan even if it's going to significantly faster.  I'm assuming some
supporting planner work needs to be added too.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Greg Stark
Date:
On Mon, Oct 10, 2011 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I talked to Robert Haas and he said that index-only scans do not
> optimize COUNT(*).  Is this something we can do for PG 9.2?  Is anyone
> working on this?

People usually conflate multiple problems when they talk about
"count(*). The usual case people are concerned about would require
materialized views, not index-only scans.


--
greg


Re: COUNT(*) and index-only scans

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:
> I talked to Robert Haas and he said that index-only scans do not
> optimize COUNT(*).  Is this something we can do for PG 9.2?  Is
> anyone working on this?
Well, it's not that it doesn't optimize COUNT(*) -- it's that it
doesn't yet cost the index scan as cheaper than a table scan when
you're accessing every row.
create table t (id int not null primary key);
insert into t select generate_series(1, 1000000);
vacuum freeze analyze;
explain analyze select count(*) from t where id between 500000 and 500010;
That gives you an index-only scan; but without the WHERE clause it
uses a seq scan.  I think it's mainly a matter of doing enough
benchmarks to figure out how best to model the costs of the index
scan so that it can be picked for that case.
-Kevin


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Mon, Oct 10, 2011 at 1:36 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>> I talked to Robert Haas and he said that index-only scans do not
>> optimize COUNT(*).  Is this something we can do for PG 9.2?  Is
>> anyone working on this?
>
> Well, it's not that it doesn't optimize COUNT(*) -- it's that it
> doesn't yet cost the index scan as cheaper than a table scan when
> you're accessing every row.
>
> create table t (id int not null primary key);
> insert into t select generate_series(1, 1000000);
> vacuum freeze analyze;
> explain analyze select count(*) from t
>  where id between 500000 and 500010;
>
> That gives you an index-only scan; but without the WHERE clause it
> uses a seq scan.  I think it's mainly a matter of doing enough
> benchmarks to figure out how best to model the costs of the index
> scan so that it can be picked for that case.

Right now, our costing model for index-only scans is pretty dumb.  It
assumes that using an index-only scan will avoid 10% of the heap
fetches.  That could easily be low, and on an insert-only table or one
where only the recently-updated rows are routinely accessed, it could
also be high.  To use an index-only scan for a full-table COUNT(*),
we're going to have to be significantly smarter, because odds are good
that skipping 10% of the heap fetches won't be sufficient inducement
to the planner to go that route; we are going to need a real number.

This isn't just an exercise in costing, though: right now, we don't
even generate a plan to use an index for a full-table scan, because we
assume that it can never be cheaper.  This is actually not quite true
even in previous releases (suppose the table is severely bloated but
the index is not) and it's going to be less true now that we have
index-only scans.  So that's going to need some adjustment, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Bruce Momjian <bruce@momjian.us> wrote:
>> I talked to Robert Haas and he said that index-only scans do not
>> optimize COUNT(*).  Is this something we can do for PG 9.2?  Is
>> anyone working on this?
> Well, it's not that it doesn't optimize COUNT(*) -- it's that it
> doesn't yet cost the index scan as cheaper than a table scan when
> you're accessing every row.

I think what Robert is complaining about is that we won't currently
consider an index that matches neither any WHERE clauses nor ORDER BY,
ie, count(*) over the whole table won't get considered for an index-only
scan, regardless of cost estimates.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> Right now, our costing model for index-only scans is pretty dumb. 
> It assumes that using an index-only scan will avoid 10% of the
> heap fetches.  That could easily be low, and on an insert-only
> table or one where only the recently-updated rows are routinely
> accessed, it could also be high.
As a reality check, I just ran this query on a table in a statewide
copy of our data:
select count(*), sum(case when xmin = '2'::xid then 0 else 1 end) as read_heap from "CaseHist";
and got:  count   | read_heap 
-----------+-----------205765311 |   3934924
So on our real-world database, it would skip something on the order
of 98% of the heap reads, right?
> This isn't just an exercise in costing, though: right now, we
> don't even generate a plan to use an index for a full-table scan,
> because we assume that it can never be cheaper.  This is actually
> not quite true even in previous releases (suppose the table is
> severely bloated but the index is not) and it's going to be less
> true now that we have index-only scans.  So that's going to need
> some adjustment, too.
OK.  Thanks for clarifying.
-Kevin


Re: COUNT(*) and index-only scans

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think what Robert is complaining about is that we won't
> currently consider an index that matches neither any WHERE clauses
> nor ORDER BY, ie, count(*) over the whole table won't get
> considered for an index-only scan, regardless of cost estimates.
I guess the trick would be to get it to consider such plans only
under some conditions, to avoid explosive growth in planning time
for some types of queries.  Some statistics bucket for the number of
non-frozen tuples in the relation, maybe?
-Kevin


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think what Robert is complaining about is that we won't
>> currently consider an index that matches neither any WHERE clauses
>> nor ORDER BY, ie, count(*) over the whole table won't get
>> considered for an index-only scan, regardless of cost estimates.
> I guess the trick would be to get it to consider such plans only
> under some conditions, to avoid explosive growth in planning time
> for some types of queries.  Some statistics bucket for the number of
> non-frozen tuples in the relation, maybe?

My intention was to allow it to consider any covering index.  You're
thinking about the cost estimate, which is really entirely different.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Jeff Janes
Date:
On Mon, Oct 10, 2011 at 10:36 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>> I talked to Robert Haas and he said that index-only scans do not
>> optimize COUNT(*).  Is this something we can do for PG 9.2?  Is
>> anyone working on this?
>
> Well, it's not that it doesn't optimize COUNT(*) -- it's that it
> doesn't yet cost the index scan as cheaper than a table scan when
> you're accessing every row.
>
> create table t (id int not null primary key);
> insert into t select generate_series(1, 1000000);
> vacuum freeze analyze;
> explain analyze select count(*) from t
>  where id between 500000 and 500010;
>
> That gives you an index-only scan; but without the WHERE clause it
> uses a seq scan.

If you convert the where clause to "where id is not null" it uses the
index only scan again, but only if you nudge it too with
enable_seqscan=off.

I'm not sure why it needs the nudge in one case but not the other.

Cheers,

Jeff


Re: COUNT(*) and index-only scans

From
Greg Stark
Date:
On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My intention was to allow it to consider any covering index.  You're
> thinking about the cost estimate, which is really entirely different.
>

Is there any reason to consider more than one? I would have expected
the narrowest one to be the best choice. There's something to be said
for using the same index consistently but we already have that problem
and make no attempt to do that. And partial indexes might be better
but then we would already be considering them if their constraints are
satisfied.

--
greg


Re: COUNT(*) and index-only scans

From
"Kevin Grittner"
Date:
> Jeff Janes  wrote:
> Kevin Grittner  wrote:
>> create table t (id int not null primary key);
>> insert into t select generate_series(1, 1000000);
>> vacuum freeze analyze;
>> explain analyze select count(*) from t
>> where id between 500000 and 500010;
>>
>> That gives you an index-only scan; but without the WHERE clause it
>> uses a seq scan.
> 
> If you convert the where clause to "where id is not null" it uses
> the index only scan again, but only if you nudge it too with
> enable_seqscan=off.
Clever way to get a full-table test.
It turns out that for the above, with your trick to use the index
only scan, it comes out 12% faster to do a seqscan, even when the
table and index are fully cached (based on the average time of ten
runs each way).  There's very little overlap, so the difference looks
real.  But that's on a very narrow record, having just the one column
used in the index.  I added one wide column like this:
alter table t add column x text;
update t set x = (repeat(random()::text, (random() * 100)::int));
cluster t USING t_pkey;
vacuum freeze analyze;
With that change the index-only scan time remained unchanged, while
the seqscan time grew to about 2.6 times the index only scan time.
That was mildly surprising for me, considering it was all still
cached.
-Kevin


Re: COUNT(*) and index-only scans

From
jesper@krogh.cc
Date:
>> Jeff Janes  wrote:
>> Kevin Grittner  wrote:
>
>>> create table t (id int not null primary key);
>>> insert into t select generate_series(1, 1000000);
>>> vacuum freeze analyze;
>>> explain analyze select count(*) from t
>>> where id between 500000 and 500010;
>>>
>>> That gives you an index-only scan; but without the WHERE clause it
>>> uses a seq scan.
>>
>> If you convert the where clause to "where id is not null" it uses
>> the index only scan again, but only if you nudge it too with
>> enable_seqscan=off.
>
> Clever way to get a full-table test.
>
> It turns out that for the above, with your trick to use the index
> only scan, it comes out 12% faster to do a seqscan, even when the
> table and index are fully cached (based on the average time of ten
> runs each way).  There's very little overlap, so the difference looks
> real.  But that's on a very narrow record, having just the one column
> used in the index.  I added one wide column like this:
>
> alter table t add column x text;
> update t set x = (repeat(random()::text, (random() * 100)::int));
> cluster t USING t_pkey;
> vacuum freeze analyze;
>
> With that change the index-only scan time remained unchanged, while
> the seqscan time grew to about 2.6 times the index only scan time.
> That was mildly surprising for me, considering it was all still
> cached.

Moving data around in memory isn't that cheap, and you have probably made
the row size 5 times larger by the above thing. However if you added 4000
bytes instead of 100, you should be back to the previous results since
data then ends up (most likely,otherwise add a bit more) in a toast table.

Jesper




Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Mon, Oct 10, 2011 at 3:15 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>
>> Right now, our costing model for index-only scans is pretty dumb.
>> It assumes that using an index-only scan will avoid 10% of the
>> heap fetches.  That could easily be low, and on an insert-only
>> table or one where only the recently-updated rows are routinely
>> accessed, it could also be high.
>
> As a reality check, I just ran this query on a table in a statewide
> copy of our data:
>
> select count(*),
>  sum(case when xmin = '2'::xid then 0 else 1 end) as read_heap
>  from "CaseHist";
>
> and got:
>
>   count   | read_heap
> -----------+-----------
>  205765311 |   3934924
>
> So on our real-world database, it would skip something on the order
> of 98% of the heap reads, right?

Yeah, if it's scanning the whole table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > My intention was to allow it to consider any covering index. ?You're
> > thinking about the cost estimate, which is really entirely different.
> >
> 
> Is there any reason to consider more than one? I would have expected
> the narrowest one to be the best choice. There's something to be said
> for using the same index consistently but we already have that problem
> and make no attempt to do that. And partial indexes might be better
> but then we would already be considering them if their constraints are
> satisfied.

Actually, I think the smallest non-partial one on disk might be the best
--- that is very easy to find out.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Tue, Oct 11, 2011 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Greg Stark wrote:
>> On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > My intention was to allow it to consider any covering index. ?You're
>> > thinking about the cost estimate, which is really entirely different.
>> >
>>
>> Is there any reason to consider more than one? I would have expected
>> the narrowest one to be the best choice. There's something to be said
>> for using the same index consistently but we already have that problem
>> and make no attempt to do that. And partial indexes might be better
>> but then we would already be considering them if their constraints are
>> satisfied.
>
> Actually, I think the smallest non-partial one on disk might be the best
> --- that is very easy to find out.

I doubt there is any need to write special-purpose code to decide
which index ought to be used for a full table scan.  We can just throw
all of the otherwise-useless indexes into the costing machinery with
empty pathkeys, and let them duke it out.  All but the best one will
be instantly discarded, and the best one will either beat or lose to a
sequential scan.  All of this will happen before we start trying to
build join paths, so there's no combinatorial explosion in planning
time - it'll just be a straightforward cost comparison between plans
with identical pathkeys, and should be quite fast.

The real issue is that the costing estimates need to be accurate, and
that's where the rubber hits the road.  Otherwise, even if we pick the
right way to scan the table, we may do silly things up the line when
we go to start constructing the join order.  I think we need to beef
up ANALYZE to gather statistics on the fraction of the pages that are
marked all-visible, or maybe VACUUM should gather that information.
The trouble is that if we VACUUM and then ANALYZE, we'll often get
back a value very close to 100%, but then the real value may diminish
quite a bit before the next auto-analyze fires.  I think if we can
figure out what to do about that problem we'll be well on our way...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Simon Riggs
Date:
On Tue, Oct 11, 2011 at 5:45 AM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My intention was to allow it to consider any covering index.  You're
>> thinking about the cost estimate, which is really entirely different.
>>
>
> Is there any reason to consider more than one? I would have expected
> the narrowest one to be the best choice. There's something to be said
> for using the same index consistently but we already have that problem
> and make no attempt to do that. And partial indexes might be better
> but then we would already be considering them if their constraints are
> satisfied.

You raise a fantastic idea. Use the frequency of use as a factor of an
index in the cost of optimising a query.

We have previously discussed the idea of using the RAM residency of an
index to control the cost. That is difficult to judge.

Using the long term prevalence of usage as a weighting factor makes a
great deal of sense for queries that could potentially utilise
multiple indexes. That information is readily available and directly
applicable. The prevalence of use directly drives RAM residency, so it
makes sense to use the causal factor as input to the cost.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COUNT(*) and index-only scans

From
Josh Berkus
Date:
> The trouble is that if we VACUUM and then ANALYZE, we'll often get
> back a value very close to 100%, but then the real value may diminish
> quite a bit before the next auto-analyze fires.  I think if we can
> figure out what to do about that problem we'll be well on our way...

It's not so much an issue of when the last auto-analyze was as an issue
of the number of rows in write transactions against that table in the
last X minutes.  This is where it really hurts us that
pg_stat_user_tables is not time-based.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Tue, Oct 11, 2011 at 3:18 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> The trouble is that if we VACUUM and then ANALYZE, we'll often get
>> back a value very close to 100%, but then the real value may diminish
>> quite a bit before the next auto-analyze fires.  I think if we can
>> figure out what to do about that problem we'll be well on our way...
>
> It's not so much an issue of when the last auto-analyze was as an issue
> of the number of rows in write transactions against that table in the
> last X minutes.  This is where it really hurts us that
> pg_stat_user_tables is not time-based.

The number of write transactions in the last X minutes seems pretty
much irrelevant.

What matters is the number of previously-all-visible pages written
since the last vacuum.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Jeff Davis
Date:
On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote:
> The real issue is that the costing estimates need to be accurate, and
> that's where the rubber hits the road.  Otherwise, even if we pick the
> right way to scan the table, we may do silly things up the line when
> we go to start constructing the join order.  I think we need to beef
> up ANALYZE to gather statistics on the fraction of the pages that are
> marked all-visible, or maybe VACUUM should gather that information.
> The trouble is that if we VACUUM and then ANALYZE, we'll often get
> back a value very close to 100%, but then the real value may diminish
> quite a bit before the next auto-analyze fires.  I think if we can
> figure out what to do about that problem we'll be well on our way...

Can you send stats messages to keep track when you unset a bit in the
VM? That might allow it to be more up-to-date.

Regards,Jeff Davis



Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Wed, Oct 12, 2011 at 2:50 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote:
>> The real issue is that the costing estimates need to be accurate, and
>> that's where the rubber hits the road.  Otherwise, even if we pick the
>> right way to scan the table, we may do silly things up the line when
>> we go to start constructing the join order.  I think we need to beef
>> up ANALYZE to gather statistics on the fraction of the pages that are
>> marked all-visible, or maybe VACUUM should gather that information.
>> The trouble is that if we VACUUM and then ANALYZE, we'll often get
>> back a value very close to 100%, but then the real value may diminish
>> quite a bit before the next auto-analyze fires.  I think if we can
>> figure out what to do about that problem we'll be well on our way...
>
> Can you send stats messages to keep track when you unset a bit in the
> VM? That might allow it to be more up-to-date.

In theory, that seems like it would work, although I'm a little
worried about the overhead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 12, 2011 at 2:50 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote:
>>> The real issue is that the costing estimates need to be accurate, and
>>> that's where the rubber hits the road.

>> Can you send stats messages to keep track when you unset a bit in the
>> VM? That might allow it to be more up-to-date.

> In theory, that seems like it would work, although I'm a little
> worried about the overhead.

I think it's overkill, and possibly unpleasantly unstable as well.
For the initial attack on this we should just have VACUUM and ANALYZE
count the number of all-visible blocks and store that in pg_class along
with the tuple-count statistics.  There's no reason to think that this
will be any worse than the way we deal with dead tuple counts, for
instance.

What bothers me considerably more is the issue about how specific
queries might see an all-visible fraction that's very substantially
different from the table's overall ratio, especially in examples such as
historical-data tables where most of the update and query activity has
to do with recently-added rows.  I don't see any practical way to attack
that problem with statistics; we're just going to have to adopt some
estimation rule.  What I suggest as a first cut for that is: simply
derate the visibility fraction as the fraction of the table expected to
be scanned gets smaller.  That is, if the query fetches nearly all of
the table, take the stored visibility ratio at face value; if it fetches
only one block, never believe that that will be an all-visible block;
and in general if we're expecting to read a fraction f of the pages,
multiply the whole-table visibility ratio by f before using it in the
cost estimate.  This amounts to assuming that the historical-data case
is the usual case, but I'm not sure that's unfair.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Greg Stark
Date:
On Wed, Oct 12, 2011 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think it's overkill, and possibly unpleasantly unstable as well.
> For the initial attack on this we should just have VACUUM and ANALYZE
> count the number of all-visible blocks and store that in pg_class along
> with the tuple-count statistics.  There's no reason to think that this
> will be any worse than the way we deal with dead tuple counts, for
> instance.

So I have a theory.

Assuming you're in a steady-state situation the amount of all-visible
blocks will fluctuate from a high just after vacuum to a low just
before the next vacuum. There are other ways a block can be marked
all-visible but for the most part I would expect the fraction to go
steadily down until vacuum comes along and cleans things up.

So if vacuum tracked the fraction of blocks marked all-visible
*before* it processed them and the fraction it marked all-visible
after processing we have an upper and lower bound. If we knew how long
it's been since vacuum we could interpolate between those, or we could
just take the mean, or we could take the lower bound as a conservative
estimate.

> What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction
> of the table expected to be scanned gets smaller.

I think there's a statistically more rigorous way of accomplishing the
same thing. If you treat the pages we estimate we're going to read as
a random sample of the population of pages then your expected value is
the fraction of the overall population that is all-visible but your
95th percentile confidence interval will be, uh, a simple formula we
can compute but I don't recall off-hand.

This gets back to a discussion long-ago of what estimates the planner
should be using. It currently uses all expected values but in many
cases it would be valuable if the planner knew what the standard
deviation of those estimates was. It might sometimes be better to pick
a plan that's slightly worse on average but is less likely to be much
worse.

--
greg


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What bothers me considerably more is the issue about how specific
> queries might see an all-visible fraction that's very substantially
> different from the table's overall ratio, especially in examples such as
> historical-data tables where most of the update and query activity has
> to do with recently-added rows.  I don't see any practical way to attack
> that problem with statistics; we're just going to have to adopt some
> estimation rule.  What I suggest as a first cut for that is: simply
> derate the visibility fraction as the fraction of the table expected to
> be scanned gets smaller.  That is, if the query fetches nearly all of
> the table, take the stored visibility ratio at face value; if it fetches
> only one block, never believe that that will be an all-visible block;
> and in general if we're expecting to read a fraction f of the pages,
> multiply the whole-table visibility ratio by f before using it in the
> cost estimate.  This amounts to assuming that the historical-data case
> is the usual case, but I'm not sure that's unfair.

I don't think that's an unfair assumption -- in fact I think it's
exactly the right assumption -- but I'm worried about how the math
works out with that specific proposal.

- Suppose VACUUM processes the table and makes it all-visible.  Then,
somebody comes along and updates one tuple on every page, making them
all not-all-visible, but not trigger VACUUM because we're nowhere
close the 20% threshold.  Now COUNT(*) will think it should use an
index-scan, but really... not so much.  In fact, even if it's only
that a tuple has been updated on 25% of the pages, we're probably in
trouble.

- Suppose the table has a million rows and we're going to read 100 of
them, or 0.01%.  Now it might appear that a covering index has a
negligible advantage over a non-covering index, but in fact I think we
still want to err on the side of trying to use the covering index.  In
fact, even if we're only reading a single row, we probably still
generally want to pick up the covering index, to cater to the case
where someone is doing primary key fetches against a gigantic table
and hoping that index-only scans will save them from random I/O hell.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> Assuming you're in a steady-state situation the amount of all-visible
> blocks will fluctuate from a high just after vacuum to a low just
> before the next vacuum. There are other ways a block can be marked
> all-visible but for the most part I would expect the fraction to go
> steadily down until vacuum comes along and cleans things up.

> So if vacuum tracked the fraction of blocks marked all-visible
> *before* it processed them and the fraction it marked all-visible
> after processing we have an upper and lower bound. If we knew how long
> it's been since vacuum we could interpolate between those, or we could
> just take the mean, or we could take the lower bound as a conservative
> estimate.

I thought of that too, but we don't do the comparable thing for dead
tuple counts, and I am not convinced that we should do it for
visibility.  I'd rather have a simple rule that "it's right immediately
after VACUUM", so that at least trivial cases like read-only tables
work correctly.

>> What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction
>> of the table expected to be scanned gets smaller.

> I think there's a statistically more rigorous way of accomplishing the
> same thing. If you treat the pages we estimate we're going to read as
> a random sample of the population of pages then your expected value is
> the fraction of the overall population that is all-visible but your
> 95th percentile confidence interval will be, uh, a simple formula we
> can compute but I don't recall off-hand.

The problem is precisely that the pages a query is going to read are
likely to *not* be a random sample, but to be correlated with
recently-dirtied pages.

> ... It currently uses all expected values but in many
> cases it would be valuable if the planner knew what the standard
> deviation of those estimates was.

No doubt, but I'm not volunteering to fix that before we can have a
non-toy estimate for index-only scans.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What bothers me considerably more is the issue about how specific
>> queries might see an all-visible fraction that's very substantially
>> different from the table's overall ratio,

> - Suppose VACUUM processes the table and makes it all-visible.  Then,
> somebody comes along and updates one tuple on every page, making them
> all not-all-visible, but not trigger VACUUM because we're nowhere
> close the 20% threshold.  Now COUNT(*) will think it should use an
> index-scan, but really... not so much.  In fact, even if it's only
> that a tuple has been updated on 25% of the pages, we're probably in
> trouble.

Yeah, but that would be a pretty unlucky pattern, and in any case the
fix for it is going to be to make autovacuum more aggressive.

> - Suppose the table has a million rows and we're going to read 100 of
> them, or 0.01%.  Now it might appear that a covering index has a
> negligible advantage over a non-covering index, but in fact I think we
> still want to err on the side of trying to use the covering index.

Given that fact pattern we still will, I think.  We'll still prefer an
indexscan over a seqscan, for sure.  In any case, if you believe the
assumption that those 100 rows are more likely to be recently-dirtied
than the average row, I'm not sure why you think we should be trying to
force an assumption that index-only will succeed here.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Aidan Van Dyk
Date:
On Wed, Oct 12, 2011 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> - Suppose the table has a million rows and we're going to read 100 of
>> them, or 0.01%.  Now it might appear that a covering index has a
>> negligible advantage over a non-covering index, but in fact I think we
>> still want to err on the side of trying to use the covering index.
>
> Given that fact pattern we still will, I think.  We'll still prefer an
> indexscan over a seqscan, for sure.  In any case, if you believe the
> assumption that those 100 rows are more likely to be recently-dirtied
> than the average row, I'm not sure why you think we should be trying to
> force an assumption that index-only will succeed here.

The elephant in the room is that the index-only-scan really doesn't
save a *whole* lot if the heap pages are already in shared buffers.
But it matters a *lot* when they heap pages are not in shared buffers
(both ways, saving IO, or causing lots of random IO)

Can we hope that if pages are not in shared buffers, they are not
recently modified, so hopefully both all visible, and have the VM
bit?set?  Or does the table-based nature of vacuum mean there is no
value there?

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: COUNT(*) and index-only scans

From
Greg Stark
Date:
On Wed, Oct 12, 2011 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction
>>> of the table expected to be scanned gets smaller.
>
>> I think there's a statistically more rigorous way of accomplishing the
>> same thing. If you treat the pages we estimate we're going to read as
>> a random sample of the population of pages then your expected value is
>> the fraction of the overall population that is all-visible but your
>> 95th percentile confidence interval will be, uh, a simple formula we
>> can compute but I don't recall off-hand.
>
> The problem is precisely that the pages a query is going to read are
> likely to *not* be a random sample, but to be correlated with
> recently-dirtied pages.

Sure, but I was suggesting aiming for the nth percentile rather than a
linear factor which I don't know has any concrete meaning.


-- 
greg


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Wed, Oct 12, 2011 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The problem is precisely that the pages a query is going to read are
>> likely to *not* be a random sample, but to be correlated with
>> recently-dirtied pages.

> Sure, but I was suggesting aiming for the nth percentile rather than a
> linear factor which I don't know has any concrete meaning.

Well, I have no problem with using a more complicated estimation
equation, but it might be nice to get some field experience with the
thing before we start complicating matters.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
"Kevin Grittner"
Date:
Aidan Van Dyk <aidan@highrise.ca> wrote:
> The elephant in the room is that the index-only-scan really
> doesn't save a *whole* lot if the heap pages are already in shared
> buffers.
It's not hard to create a simple test case where it's about three
times slower to go to cached heap pages than to use the values from
the index.  That was just my first try, so it's not likely to be a
real "worst case", although was using the default shared_memory
size, so a lot of the heap pages probably came from the OS cache,
rather than being in shared memory.
> But it matters a *lot* when they heap pages are not in shared
> buffers
Yeah, obviously it matters more if you actually need to add a random
disk read.
-Kevin


Re: COUNT(*) and index-only scans

From
Greg Stark
Date:
On Wed, Oct 12, 2011 at 4:26 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> But it matters a *lot* when they heap pages are not in shared
>> buffers
>
> Yeah, obviously it matters more if you actually need to add a random
> disk read.

To be fair the indexes are also random I/O. So the case that really
matters is when the index fits in RAM but the heap does not.

-- 
greg


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:
> The elephant in the room is that the index-only-scan really doesn't
> save a *whole* lot if the heap pages are already in shared buffers.
> But it matters a *lot* when they heap pages are not in shared buffers
> (both ways, saving IO, or causing lots of random IO)

> Can we hope that if pages are not in shared buffers, they are not
> recently modified, so hopefully both all visible, and have the VM
> bit?set?  Or does the table-based nature of vacuum mean there is no
> value there?

Hmm, that's an interesting point.  If you suppose that recently-modified
pages are likely to still be in memory then it could well be that an
index-only scan is relatively cheap (i.e., not many actual disk reads)
no matter whether it hits recently-modified pages or not.  So maybe the
first cut should just be to measure the overall visibility fraction and
use that at face value.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Wed, Oct 12, 2011 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What bothers me considerably more is the issue about how specific
>>> queries might see an all-visible fraction that's very substantially
>>> different from the table's overall ratio,
>
>> - Suppose VACUUM processes the table and makes it all-visible.  Then,
>> somebody comes along and updates one tuple on every page, making them
>> all not-all-visible, but not trigger VACUUM because we're nowhere
>> close the 20% threshold.  Now COUNT(*) will think it should use an
>> index-scan, but really... not so much.  In fact, even if it's only
>> that a tuple has been updated on 25% of the pages, we're probably in
>> trouble.
>
> Yeah, but that would be a pretty unlucky pattern, and in any case the
> fix for it is going to be to make autovacuum more aggressive.

Hmm, maybe.

>> - Suppose the table has a million rows and we're going to read 100 of
>> them, or 0.01%.  Now it might appear that a covering index has a
>> negligible advantage over a non-covering index, but in fact I think we
>> still want to err on the side of trying to use the covering index.
>
> Given that fact pattern we still will, I think.  We'll still prefer an
> indexscan over a seqscan, for sure.  In any case, if you believe the
> assumption that those 100 rows are more likely to be recently-dirtied
> than the average row, I'm not sure why you think we should be trying to
> force an assumption that index-only will succeed here.

I'm not concerned about an index scan vs. a sequential scan here.  I'm
concerned about it being impossible for the DBA to get an index-only
scan when s/he wants it very badly.  The current (stupid) formula
handles this case just about perfectly - it will prefer a smaller
index over a larger one, except when a covering index is available, in
which case it will prefer the smallest covering index.  That sounds
exactly right to me.  We get that behavior because the 10% of heap
fetches that we're assuming we'll get to skip is larger than the
penalty for using a bigger index.  If we take out 10% and replace it
by all_visible_percentage * fraction_of_tuples_fetched, then that 10%
is going to drop to some infinitesmally small value on single row
fetches from giant tables.  But that's exactly one of the cases for
which people want index-only scans in the first place.  It's no better
to be overly pessimistic here than it is to be overly optimistic.  If
the table is 90% all-visible, the probability of our finding an
all-visible row is probably not 90%.  But it's probably not 0.01% or
0.0001% either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm not concerned about an index scan vs. a sequential scan here.  I'm
> concerned about it being impossible for the DBA to get an index-only
> scan when s/he wants it very badly.  The current (stupid) formula
> handles this case just about perfectly - it will prefer a smaller
> index over a larger one, except when a covering index is available, in
> which case it will prefer the smallest covering index.  That sounds
> exactly right to me.  We get that behavior because the 10% of heap
> fetches that we're assuming we'll get to skip is larger than the
> penalty for using a bigger index.  If we take out 10% and replace it
> by all_visible_percentage * fraction_of_tuples_fetched, then that 10%
> is going to drop to some infinitesmally small value on single row
> fetches from giant tables.  But that's exactly one of the cases for
> which people want index-only scans in the first place.  It's no better
> to be overly pessimistic here than it is to be overly optimistic.  If
> the table is 90% all-visible, the probability of our finding an
> all-visible row is probably not 90%.  But it's probably not 0.01% or
> 0.0001% either.

I think you're overstating the size of the problem.  Given that fact
pattern, the thing will choose an indexscan anyway, because it's the
cheapest alternative even under traditional costing rules.  And it will
choose to use an index-only scan if the index is covering.  It doesn't
matter what the exact cost estimate is.

The place where the decision is actually somewhat hard, IMO, is where
you're pulling a small part of the table but significantly more than one
row, and the traditional best choice would be a bitmap scan.  Now we
have to guess whether possibly avoiding heap fetches is better than
batching the fetches, and it doesn't seem open-and-shut to me.

But having said that, I see some credibility in Aidan's suggestion that
pages that actually have to be fetched from disk are disproportionately
likely to be all-visible.  That would counteract the history-table
effect of correlation between current reads and recent changes,
probably not perfectly, but to a considerable extent.  So right at the
moment I'm inclined to just apply the most-recently-measured visibility
fraction at face value.  We shouldn't complicate matters more than that
until we have more field experience to tell us what really happens.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Wed, Oct 12, 2011 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm not concerned about an index scan vs. a sequential scan here.  I'm
>> concerned about it being impossible for the DBA to get an index-only
>> scan when s/he wants it very badly.  The current (stupid) formula
>> handles this case just about perfectly - it will prefer a smaller
>> index over a larger one, except when a covering index is available, in
>> which case it will prefer the smallest covering index.  That sounds
>> exactly right to me.  We get that behavior because the 10% of heap
>> fetches that we're assuming we'll get to skip is larger than the
>> penalty for using a bigger index.  If we take out 10% and replace it
>> by all_visible_percentage * fraction_of_tuples_fetched, then that 10%
>> is going to drop to some infinitesmally small value on single row
>> fetches from giant tables.  But that's exactly one of the cases for
>> which people want index-only scans in the first place.  It's no better
>> to be overly pessimistic here than it is to be overly optimistic.  If
>> the table is 90% all-visible, the probability of our finding an
>> all-visible row is probably not 90%.  But it's probably not 0.01% or
>> 0.0001% either.
>
> I think you're overstating the size of the problem.  Given that fact
> pattern, the thing will choose an indexscan anyway, because it's the
> cheapest alternative even under traditional costing rules.  And it will
> choose to use an index-only scan if the index is covering.  It doesn't
> matter what the exact cost estimate is.

Maybe I'm not being clear.  The case I'm worried about is when you
have a table with an id column (an integer) and a name column (text).
You have a unique index on (id), and an index on (id, name).  You then
do SELECT name FROM foo WHERE id = ?.  I understand it's going to pick
*an* index-scan, but which index is it going to pick?  The unique
index, because it's smaller, or the other one, because it's covering?
I think if the table is large your proposal will lead to ignoring the
covering index in favor of the smaller one, and I submit that's not
what we want, because, on the average, the index-only approach is
going to succeed often enough to

> The place where the decision is actually somewhat hard, IMO, is where
> you're pulling a small part of the table but significantly more than one
> row, and the traditional best choice would be a bitmap scan.  Now we
> have to guess whether possibly avoiding heap fetches is better than
> batching the fetches, and it doesn't seem open-and-shut to me.

Yes, I agree.

I was actually wondering if there is some way we could make index-only
scans work for bitmap index scans.  Something like this: If the index
is covering, then as we retrieve each tuple, we check whether the page
is all-visible.  If so, we return the data from the index tuple.  If
not, we save the TID for later.  Then, when we get done scanning the
index, we go back and fetch all the pages containing saved TIDs in
ascending block number order.  The trouble is that I think you could
get in some trouble if you use a TID bitmap here, because if you
lossify the bitmap then you have to make sure you can't return a tuple
that you already handled with the index-only approach (imagine that
the visibility map bit gets cleared partway through the scan).  All in
all this seems pretty complicated...

> But having said that, I see some credibility in Aidan's suggestion that
> pages that actually have to be fetched from disk are disproportionately
> likely to be all-visible.  That would counteract the history-table
> effect of correlation between current reads and recent changes,
> probably not perfectly, but to a considerable extent.  So right at the
> moment I'm inclined to just apply the most-recently-measured visibility
> fraction at face value.  We shouldn't complicate matters more than that
> until we have more field experience to tell us what really happens.

Fetches from disk aren't the only problem; thrashing shared_buffers is
pretty expensive, too.  But it's an interesting point.  I guess we
could give it a try and see what happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Garick Hamlin
Date:
On Wed, Oct 12, 2011 at 03:16:54PM +0100, Greg Stark wrote:
> On Wed, Oct 12, 2011 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I think it's overkill, and possibly unpleasantly unstable as well.
> > For the initial attack on this we should just have VACUUM and ANALYZE
> > count the number of all-visible blocks and store that in pg_class along
> > with the tuple-count statistics.  There's no reason to think that this
> > will be any worse than the way we deal with dead tuple counts, for
> > instance.
> 
> So I have a theory.
> 
> Assuming you're in a steady-state situation the amount of all-visible
> blocks will fluctuate from a high just after vacuum to a low just
> before the next vacuum. There are other ways a block can be marked
> all-visible but for the most part I would expect the fraction to go
> steadily down until vacuum comes along and cleans things up.
> 
> So if vacuum tracked the fraction of blocks marked all-visible
> *before* it processed them and the fraction it marked all-visible
> after processing we have an upper and lower bound. If we knew how long
> it's been since vacuum we could interpolate between those, or we could
> just take the mean, or we could take the lower bound as a conservative
> estimate.
> 
> > What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction
> > of the table expected to be scanned gets smaller.
> 
> I think there's a statistically more rigorous way of accomplishing the
> same thing. If you treat the pages we estimate we're going to read as
> a random sample of the population of pages then your expected value is
> the fraction of the overall population that is all-visible but your
> 95th percentile confidence interval will be, uh, a simple formula we
> can compute but I don't recall off-hand.

Incidentally, I had a similar idea at PGCon relating to planning...

My idea was to compute not just the cost but the sensitivity
of the cost an estimate for each plan.   The sensitivity is the 
derivate of the cost.  So, if the total cost was n^2 the sensitivity 
would be 2n.

If you picked a tolerance (like 2 standard deviations) of the 
observed distribution.  You could compare the expected cost and the 
expected 'unlucky cost' for plans.  

(Basically, this is parametric error propagation) 

I know very little about the planner...
I don't know how how often it would lead to picking a better
plan (it might not be worth the cost to compute), but it seemed
like an interesting approach to me.

Garick



Re: COUNT(*) and index-only scans

From
Jeff Janes
Date:
On Mon, Oct 10, 2011 at 9:48 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> Jeff Janes  wrote:
>> Kevin Grittner  wrote:
>
>>> create table t (id int not null primary key);
>>> insert into t select generate_series(1, 1000000);
>>> vacuum freeze analyze;
>>> explain analyze select count(*) from t
>>> where id between 500000 and 500010;
>>>
>>> That gives you an index-only scan; but without the WHERE clause it
>>> uses a seq scan.
>>
>> If you convert the where clause to "where id is not null" it uses
>> the index only scan again, but only if you nudge it too with
>> enable_seqscan=off.

With a recent commit from (I assume) Tom, the "where id is not null"
is no longer needed.

> Clever way to get a full-table test.
>
> It turns out that for the above, with your trick to use the index
> only scan, it comes out 12% faster to do a seqscan, even when the
> table and index are fully cached (based on the average time of ten
> runs each way).  There's very little overlap, so the difference looks
> real.  But that's on a very narrow record, having just the one column
> used in the index.  I added one wide column like this:
>
> alter table t add column x text;
> update t set x = (repeat(random()::text, (random() * 100)::int));
> cluster t USING t_pkey;
> vacuum freeze analyze;
>
> With that change the index-only scan time remained unchanged, while
> the seqscan time grew to about 2.6 times the index only scan time.
> That was mildly surprising for me, considering it was all still
> cached.

I used the pgbench_accounts table from pgbench -i -s 50, where all
data fits in shared_buffers, using the -f switch with either

set enable_seqscan=off;
select count(*) from pgbench_accounts;

or

set enable_indexonlyscan=off;
select count(*) from pgbench_accounts;


With just a single client, it was a toss-up.  But with 8 concurrent
clients on a 8 CPU machine, the index-only scan was 50% faster.  So
that is a nice win, even if well-designed apps probably shouldn't be
endlessly counting rows of an unchanging table using all available
CPUs in the first place.

Cheers,

Jeff


Re: COUNT(*) and index-only scans

From
Thom Brown
Date:
On 12 October 2011 17:26, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 12, 2011 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The place where the decision is actually somewhat hard, IMO, is where
>> you're pulling a small part of the table but significantly more than one
>> row, and the traditional best choice would be a bitmap scan.  Now we
>> have to guess whether possibly avoiding heap fetches is better than
>> batching the fetches, and it doesn't seem open-and-shut to me.
>
> Yes, I agree.
>
> I was actually wondering if there is some way we could make index-only
> scans work for bitmap index scans.  Something like this: If the index
> is covering, then as we retrieve each tuple, we check whether the page
> is all-visible.  If so, we return the data from the index tuple.  If
> not, we save the TID for later.  Then, when we get done scanning the
> index, we go back and fetch all the pages containing saved TIDs in
> ascending block number order.  The trouble is that I think you could
> get in some trouble if you use a TID bitmap here, because if you
> lossify the bitmap then you have to make sure you can't return a tuple
> that you already handled with the index-only approach (imagine that
> the visibility map bit gets cleared partway through the scan).  All in
> all this seems pretty complicated...

So is there a chance of getting bitmap index-only scans?

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Tom Lane
Date:
Thom Brown <thom@linux.com> writes:
> So is there a chance of getting bitmap index-only scans?

Don't hold your breath.  It seems like a huge increment of complexity
for probably very marginal gains.  The point of a bitmap scan (as
opposed to regular indexscan) is to reduce heap accesses by combining
visits to the same page; but it's not clear how useful that is if
you're not making heap accesses at all.

Robert's sketch of how this could work, full of don't-know-how-to-do-this
as it was, still managed to omit a whole lot of reasons why it wouldn't
work.  Notably the fact that the index AM API for bitmap scans is to
return a bitmap, not index-tuple data; and trying to do the latter would
break a lot of the performance advantages that exist now for bitmap
scans.
        regards, tom lane


Re: COUNT(*) and index-only scans

From
Thom Brown
Date:
On 19 November 2011 16:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thom Brown <thom@linux.com> writes:
>> So is there a chance of getting bitmap index-only scans?
>
> Don't hold your breath.  It seems like a huge increment of complexity
> for probably very marginal gains.  The point of a bitmap scan (as
> opposed to regular indexscan) is to reduce heap accesses by combining
> visits to the same page; but it's not clear how useful that is if
> you're not making heap accesses at all.

Well consider:

pgbench=# explain analyse select count(*) from pgbench_accounts where
aid between 143243 and 374825 or aid between 523242 and 712111;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=83016.38..83016.39 rows=1 width=0) (actual 
time=1039.282..1039.282 rows=1 loops=1)  ->  Bitmap Heap Scan on pgbench_accounts  (cost=7934.54..81984.58
rows=412719 width=0) (actual time=243.012..977.946 rows=420453
loops=1)        Recheck Cond: (((aid >= 143243) AND (aid <= 374825)) OR ((aid
>= 523242) AND (aid <= 712111)))        ->  BitmapOr  (cost=7934.54..7934.54 rows=423802 width=0)
(actual time=228.934..228.934 rows=0 loops=1)              ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4299.40 rows=235782 width=0) (actual time=134.410..134.410
rows=231583 loops=1)                    Index Cond: ((aid >= 143243) AND (aid <= 374825))              ->  Bitmap Index
Scanon pgbench_accounts_pkey 
(cost=0.00..3428.78 rows=188020 width=0) (actual time=94.520..94.520
rows=188870 loops=1)                    Index Cond: ((aid >= 523242) AND (aid <= 712111))Total runtime: 1039.598 ms
(9 rows)

Since I can't get this to use an index-only scan, it will always visit
the heap.  Instead I'd be forced to write:

pgbench=# explain analyse select count(*) from (select aid from
pgbench_accounts where aid between 143243 and 374825 union all select
aid from pgbench_accounts where aid between 523242 and 712111) x;
          QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=17263.72..17263.73 rows=1 width=0) (actual 
time=232.053..232.053 rows=1 loops=1)  ->  Append  (cost=0.00..16204.22 rows=423802 width=0) (actual
time=10.925..195.134 rows=420453 loops=1)        ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..9015.04
rows=235782 width=0) (actual time=10.925..90.116 rows=231583 loops=1)              ->  Index Only Scan using
pgbench_accounts_pkeyon 
pgbench_accounts  (cost=0.00..6657.22 rows=235782 width=4) (actual
time=10.924..61.822 rows=231583 loops=1)                    Index Cond: ((aid >= 143243) AND (aid <= 374825))        ->
Subquery Scan on "*SELECT* 2"  (cost=0.00..7189.18 
rows=188020 width=0) (actual time=0.062..62.953 rows=188870 loops=1)              ->  Index Only Scan using
pgbench_accounts_pkeyon 
pgbench_accounts  (cost=0.00..5308.98 rows=188020 width=4) (actual
time=0.061..40.343 rows=188870 loops=1)                    Index Cond: ((aid >= 523242) AND (aid <= 712111))Total
runtime:232.291 ms 
(9 rows)

These 2 queries are equal only because the ranges being checked don't
overlap, so if arbitrary values were being substituted, and the ranges
did happen to overlap, that last method couldn't work.  I'd have to
use a UNION ALL in that particular case, which adds a lot of overhead
due to de-duplication.

While I accept that maybe adapting the existing bitmap index scan
functionality isn't necessarily desirable, would it be feasible to
create a corresponding bitmap index-only scan method.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Sat, Nov 19, 2011 at 11:22 AM, Thom Brown <thom@linux.com> wrote:
> While I accept that maybe adapting the existing bitmap index scan
> functionality isn't necessarily desirable, would it be feasible to
> create a corresponding bitmap index-only scan method.

I've been thinking about this a bit more; I wonder whether we could
create yet another type of index scan - let's call it a Streaming
Index Only Scan.  A streaming index only scan uses a buffer, which
holds index tuples and the corresponding CTIDs, and it has some
maximum size, probably based on work_mem.  Tuples in the buffer are
kept sorted by CTID.  The scan happens in two alternating phases:
buffer fill, and buffer drain.

In buffer fill mode, we scan the index and add matching tuples and
their CTIDs to the buffer.  When the buffer is full or the index AM
reports that there are no more tuples in the scan, we switch to buffer
drain mode.

In buffer drain mode, we repeatedly select a heap block number and
attempt to return all buffered tuples on that page.  We maintain a
counter, LastBlockNumber, initially zero, which tracks the last heap
block number so selected.  To select the next block number, we choose
the first block number greater than or equal to LastBlockNumber
referenced by any CTID in the buffer (it should be possible to design
the buffer so that this value can be computed quickly); if there are
none, we instead choose the first block number referenced by any CTID
in the buffer, period.  Having selected the block number, we check
whether the page is all-visible.  If so, we can return all the index
tuples from that page without further ado.  Otherwise, we fetch the
heap block, check visibility for each tuple, and return only those
index tuples for which the corresponding heap tuples are visible to
the scan.  If there's now enough buffer space available to be certain
that the next index tuple will fit, we switch back to buffer fill
mode; otherwise, we remain in buffer drain mode.

As compared with a bitmap index scan, this doesn't have the advantage
of being able to combine multiple indexes effectively; I don't really
see any way to make that work with the index-only scan concept in
general, except perhaps for the special case of a zero-argument
aggregate.  It also doesn't have the advantage that each heap page
will be guaranteed to be visited only once.  But in practice duplicate
visits to the same page should be uncommon; they'll be avoided when
either work_mem is sufficient to buffer the whole scan, or when
there's some degree of table clustering with respect to the index.

While I'm building castles in the sky, a further idea would be to try
to optimize the case where there's a LIMIT node above the scan.  If we
could pass down a hint indicating how many rows are thought likely to
be needed, we could enter buffer drain mode after about that many
tuples, instead of waiting until the buffer was full.  If the hint is
right, we win (and if it's wrong, we can still go back and fetch some
more tuples, at a cost of possible performance loss).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: COUNT(*) and index-only scans

From
Greg Stark
Date:
On Mon, Nov 21, 2011 at 6:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> In buffer fill mode, we scan the index and add matching tuples and
> their CTIDs to the buffer.  When the buffer is full or the index AM
> reports that there are no more tuples in the scan, we switch to buffer
> drain mode.

Instead you could do the two phases concurrently the way tuplesort
implements the tapesort from Knuth. You keep a heap of ctids with an
epoch. You fill the heap then you return the first one. Whenever you
return one you read the next one and add it to the heap. If it falls
before the last returned value you insert it with the next epoch but
if it falls afterwards you can insert it into the heap in its correct
position.




--
greg


Re: COUNT(*) and index-only scans

From
Robert Haas
Date:
On Wed, Dec 14, 2011 at 6:58 AM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Nov 21, 2011 at 6:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> In buffer fill mode, we scan the index and add matching tuples and
>> their CTIDs to the buffer.  When the buffer is full or the index AM
>> reports that there are no more tuples in the scan, we switch to buffer
>> drain mode.
>
> Instead you could do the two phases concurrently the way tuplesort
> implements the tapesort from Knuth. You keep a heap of ctids with an
> epoch. You fill the heap then you return the first one. Whenever you
> return one you read the next one and add it to the heap. If it falls
> before the last returned value you insert it with the next epoch but
> if it falls afterwards you can insert it into the heap in its correct
> position.

Yeah, that's pretty much what I was imagining, although my explanation
of it was slightly muddled.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company