Thread: the big picture for index-only scans

the big picture for index-only scans

From

Robert Haas

Date:

09 May 2011, 23:25:28

So, what do we need in order to find our way to index-only scans?

1. The visibility map needs to be crash-safe.  The basic idea of
index-only scans is that, instead of checking the heap to find out
whether each tuple is visible, we first check the visibility map.  If
the visibility map bit is set, then we know all tuples on the page are
visible to all transactions, and therefore the tuple of interest is
visible to our transaction.  Assuming that a significant number of
visibility map bits are set, this should enable us to avoid a fair
amount of I/O, especially on large tables, because the visibility map
is roughly 8000 times smaller than the heap, and therefore far more
practical to keep in cache.  However, before we can rely on the
visibility map for this purpose, we need to fix the problems that can
leave bits set inappropriately in the face of an inconveniently-timed
crash.  I've been working on a patch for this on and off for a few
months now; my latest version is in need of review[1].

2. Crash safe visibility map vs. pg_upgrade.  Even if we make the
visibility map crash-safe in 9.2, people are going to want to use
pg_upgrade to migrate from older versions, bringing their
possibly-not-quite-correct visibility map forks along with them.  How
should we handle that?  We could (2A) arrange to have pg_upgrade nuke
all visibility forks when upgrading from a release where the
visibility map is not crash-safe to one where it is; (2B) keep a piece
of state somewhere indicating, for each relation, whether or not the
visibility map can be trusted, set it to false only if pg_upgrade
brings the relation over from and older version, and set it to true
after a successful vacuum that skips no intervening pages; or (2C)
advise the user to do a VACUUM FULL on each of their tables
pre-upgrade, and if they don't, treat wrong answers as their own
fault.  (I doubt anyone will advocate for this option, but for the
sake of completeness...).  (2A) seems like the simplest solution,
especially because it also avoids the overhead of checking the "is the
visibility map bit reliable?" flag every time we want to plan a query.

3. Statistics.  I believe that in order to accurately estimate the
cost of an index-only scan, we're going to need to know the fraction
of tuples that are on pages whose visibility map bits are set.  I
believe it should be fairly straightforward to have ANALYZE collect
this information; and I'm inclined to do that as a separate patch.  It
seems like it would also be nice to know what fraction of tuples are
on pages that don't have the visibility map set but where, in fact,
all tuples on the page are visible to all transactions, so it would be
legal to set the bit.  A large discrepancy between these two
percentages might be a good reason to auto-vacuum the table (perhaps
using a "really lazy vacuum"[2]).  I'm not sure if this can be added
cheaply, though, and in any case, any change to the set of criteria
that will trigger an auto-vacuum is probably a can of worms.
Thoughts?

4. Planner and executor changes.  In contrast to Heikki's original
implementation, I'm inclined to not to try to split the Index Scan
node into index scan and heap fetch.  Since there are many choices for
where to put the heap fetch node (any level of the join tree between
the index scan and the root), this seems likely to result in a
combinatorial explosion of paths[3], and I'm not real sure that the
payback will be adequate.  Furthermore, the idea of allowing user code
to see tuples that will only later be determined not to have been
visible to that MVCC snapshot in the first place seems pretty scary
from a security perspective, though certainly there are possible
benefits[4].  Instead, I'm inclined to just have the planner evaluate
whether the necessary columns can be extracted purely from the index.
If not, we proceed as now.  If so, we can use the "index only"
approach of using the visibility map to decide which heap fetches can
be skipped.  It's not clear to me whether we need to compare the cost
of the standard approach with the cost of the "index only" approach:
in theory, if there aren't any visibility map bits anyway, the "index
only" approach could be slower.  But I'm not sure whether that problem
is significant or common enough to be worth expending a lot of code
on.  Either way, the number of actual paths doesn't need to increase,
because in this design, even if we apply a costing model, one approach
will dominate the other.  Heikki also suggested considering index
scans in cases where we don't now[4, again] but I'm inclined to leave
this, too, for a later optimization, again because balancing the
increase in paths against the possible performance benefits of using
indexes in more situations seems finicky.  In short, for a first cut
at this, I just want to look at this as a way to get cheaper index
scans, and leave everything else to future work.

Any thoughts welcome.  Incidentally, if anyone else feels like working
on this, feel free to let me know and I'm happy to step away, from all
of it or from whatever part someone else wants to tackle.  I'm mostly
working on this because it's something that I think we really need to
get done, more than having a burning desire to be the one who does it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://archives.postgresql.org/pgsql-hackers/2011-05/msg00292.php
[2] http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php
[3] http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php
[4] http://archives.postgresql.org/pgsql-hackers/2009-07/msg00675.php

Re: the big picture for index-only scans

From

Merlin Moncure

Date:

09 May 2011, 23:37:21

On Mon, May 9, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> So, what do we need in order to find our way to index-only scans?
>
> 1. The visibility map needs to be crash-safe.  The basic idea of
> index-only scans is that, instead of checking the heap to find out
> whether each tuple is visible, we first check the visibility map.  If
> the visibility map bit is set, then we know all tuples on the page are
> visible to all transactions, and therefore the tuple of interest is
> visible to our transaction.  Assuming that a significant number of
> visibility map bits are set, this should enable us to avoid a fair
> amount of I/O, especially on large tables, because the visibility map
> is roughly 8000 times smaller than the heap, and therefore far more
> practical to keep in cache.

hm, what are the implications for tuple hint bits, short and long
term?  I'm particularly interested if you think any hint bit i/o
mitigation strategies are worth pursuing.

> 2. Crash safe visibility map vs. pg_upgrade.  Even if we make the
> visibility map crash-safe in 9.2, people are going to want to use
> pg_upgrade to migrate from older versions, bringing their
> possibly-not-quite-correct visibility map forks along with them.  How
> should we handle that?  We could (2A) arrange to have pg_upgrade nuke
> all visibility forks when upgrading from a release where the
> visibility map is not crash-safe to one where it is;

+1 on 2A.

> 3. Statistics.  I believe that in order to accurately estimate the
> cost of an index-only scan, we're going to need to know the fraction
> of tuples that are on pages whose visibility map bits are set.

It would be helpful to know the performance benefit of index only
scans before knowing how much benefit to attribute here.  Maybe a
system wide kludge would for starters anyway, like assuming 60% of
pages can be vis checked from the VM, or a single GUC, Then again,
maybe not.

merlin

Re: the big picture for index-only scans

From

Robert Haas

Date:

10 May 2011, 10:22:46

On Mon, May 9, 2011 at 10:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> 1. The visibility map needs to be crash-safe.  The basic idea of
>> index-only scans is that, instead of checking the heap to find out
>> whether each tuple is visible, we first check the visibility map.  If
>> the visibility map bit is set, then we know all tuples on the page are
>> visible to all transactions, and therefore the tuple of interest is
>> visible to our transaction.  Assuming that a significant number of
>> visibility map bits are set, this should enable us to avoid a fair
>> amount of I/O, especially on large tables, because the visibility map
>> is roughly 8000 times smaller than the heap, and therefore far more
>> practical to keep in cache.
>
> hm, what are the implications for tuple hint bits, short and long
> term?  I'm particularly interested if you think any hint bit i/o
> mitigation strategies are worth pursuing.

Well, I don't really want to let this thread on my project get
hijacked to talk about your project (not that I haven't been guilty of
that myself!) but, in brief, I think the main effect of index-only
scans is that the performance difference between a vacuumed table and
an unvacuumed table is going to increase.  It's already the case that
sequential scanning a table which has been vacuumed (and, therefore,
all the pages are marked all-visible) is noticeably faster than
sequential scanning a table which is not vacuumed (even if all the
hint bits are set).  Index-only scans are going to extend that by
making index scans run faster on a table with lots of all-visible
tables than on one where no pages are all-visible.  So the importance
of vacuuming an insert-only table occasionally (which autovacuum won't
do, at present, until it's needed to prevent XID wraparound) is
already more than zero, and it's going to go up.  But the all-visible
bits aren't quite the same as hint bits: I don't think there's any
impact on hint bits per se.

>> 2. Crash safe visibility map vs. pg_upgrade.  Even if we make the
>> visibility map crash-safe in 9.2, people are going to want to use
>> pg_upgrade to migrate from older versions, bringing their
>> possibly-not-quite-correct visibility map forks along with them.  How
>> should we handle that?  We could (2A) arrange to have pg_upgrade nuke
>> all visibility forks when upgrading from a release where the
>> visibility map is not crash-safe to one where it is;
>
> +1 on 2A.

OK.  Anybody else?

>> 3. Statistics.  I believe that in order to accurately estimate the
>> cost of an index-only scan, we're going to need to know the fraction
>> of tuples that are on pages whose visibility map bits are set.
>
> It would be helpful to know the performance benefit of index only
> scans before knowing how much benefit to attribute here.  Maybe a
> system wide kludge would for starters anyway, like assuming 60% of
> pages can be vis checked from the VM, or a single GUC, Then again,
> maybe not.

Yeah, maybe I should try to beat the main patch into some kind of
shape before working too much on the statistics stuff.  Then we could
actually benchmark it a bit, which would be good.  I don't think that
a system-wide kludge or GUC is going to work for prime time, but it's
probably fine for initial performance testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Merlin Moncure

Date:

10 May 2011, 11:01:06

On Tue, May 10, 2011 at 8:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 9, 2011 at 10:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> 1. The visibility map needs to be crash-safe.  The basic idea of
>>> index-only scans is that, instead of checking the heap to find out
>>> whether each tuple is visible, we first check the visibility map.  If
>>> the visibility map bit is set, then we know all tuples on the page are
>>> visible to all transactions, and therefore the tuple of interest is
>>> visible to our transaction.  Assuming that a significant number of
>>> visibility map bits are set, this should enable us to avoid a fair
>>> amount of I/O, especially on large tables, because the visibility map
>>> is roughly 8000 times smaller than the heap, and therefore far more
>>> practical to keep in cache.
>>
>> hm, what are the implications for tuple hint bits, short and long
>> term?  I'm particularly interested if you think any hint bit i/o
>> mitigation strategies are worth pursuing.
>
> Well, I don't really want to let this thread on my project get
> hijacked to talk about your project (not that I haven't been guilty of
> that myself!)

no, that wasn't my intent at all, except in the sense of wondering if
a crash-safe visibility map provides a route of displacing a lot of
hint bit i/o and by extension, making alternative approaches of doing
that, including mine, a lot less useful.  that's a good thing.

meaning: since the vis map approach is going to be a fairly large win
over the classic approach to checking visibility in so many scenarios,
maybe the real long term goal should be just being as aggressive as
possible in terms of making sure it's set properly, and just give up
and be a bit more brute forcey when it's not set.  it's a fair
question.  that's a pretty broad statement, but that's what I'm
thinking about.

merlin

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

10 May 2011, 11:58:48

2011/5/10 Robert Haas <robertmhaas@gmail.com>:
> So, what do we need in order to find our way to index-only scans?
>
> 3. Statistics.  I believe that in order to accurately estimate the
> cost of an index-only scan, we're going to need to know the fraction
> of tuples that are on pages whose visibility map bits are set.  I
> believe it should be fairly straightforward to have ANALYZE collect
> this information; and I'm inclined to do that as a separate patch.  It
> seems like it would also be nice to know what fraction of tuples are
> on pages that don't have the visibility map set but where, in fact,
> all tuples on the page are visible to all transactions, so it would be
> legal to set the bit.  A large discrepancy between these two
> percentages might be a good reason to auto-vacuum the table (perhaps
> using a "really lazy vacuum"[2]).  I'm not sure if this can be added
> cheaply, though, and in any case, any change to the set of criteria
> that will trigger an auto-vacuum is probably a can of worms.
> Thoughts?

ANALYZE can do the stats job for 'free' on the pages it collects
anyway. So that looks like a good idea.
I believe the really lazy vacuum is another topic; even if it will
improve the performance of the index only scan to have tables already
vacuuumed, the stats should expose that and the function
cost_index(_only?)() taking care of that.

>
> 4. Planner and executor changes.  In contrast to Heikki's original
> implementation, I'm inclined to not to try to split the Index Scan
> node into index scan and heap fetch.  Since there are many choices for
> where to put the heap fetch node (any level of the join tree between
> the index scan and the root), this seems likely to result in a
> combinatorial explosion of paths[3], and I'm not real sure that the
> payback will be adequate.  Furthermore, the idea of allowing user code
> to see tuples that will only later be determined not to have been
> visible to that MVCC snapshot in the first place seems pretty scary
> from a security perspective, though certainly there are possible
> benefits[4].  Instead, I'm inclined to just have the planner evaluate
> whether the necessary columns can be extracted purely from the index.

The temptation is high to estimate the cost of an "index_scan(only) +
ordered(by ctid) table pages fetch if heap required". (this is what I
understood from heikki suggestion 3-4. and it makes sense). It may be
easier to implement both at once but I didn't find the branch in the
Heikki's git repos. (probably removed since the long time)

> If not, we proceed as now.  If so, we can use the "index only"
> approach of using the visibility map to decide which heap fetches can
> be skipped.  It's not clear to me whether we need to compare the cost
> of the standard approach with the cost of the "index only" approach:
> in theory, if there aren't any visibility map bits anyway, the "index
> only" approach could be slower.  But I'm not sure whether that problem
> is significant or common enough to be worth expending a lot of code
> on.  Either way, the number of actual paths doesn't need to increase,
> because in this design, even if we apply a costing model, one approach
> will dominate the other.  Heikki also suggested considering index
> scans in cases where we don't now[4, again] but I'm inclined to leave
> this, too, for a later optimization, again because balancing the
> increase in paths against the possible performance benefits of using
> indexes in more situations seems finicky.  In short, for a first cut
> at this, I just want to look at this as a way to get cheaper index
> scans, and leave everything else to future work.

Based on ANALYZE stats for the visibility, I believe cost_index and
cost_index_only should be very similar functions (well, atm, I don't
see the point to split it in 2 functions).

>
> Any thoughts welcome.  Incidentally, if anyone else feels like working
> on this, feel free to let me know and I'm happy to step away, from all
> of it or from whatever part someone else wants to tackle.  I'm mostly
> working on this because it's something that I think we really need to
> get done, more than having a burning desire to be the one who does it.

Indexonly scans are welcome!
I believe I can help on 3 and 4, but (really) not sure for 1 and 2.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> [1] http://archives.postgresql.org/pgsql-hackers/2011-05/msg00292.php
> [2] http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php
> [3] http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php
> [4] http://archives.postgresql.org/pgsql-hackers/2009-07/msg00675.php
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Robert Haas

Date:

10 May 2011, 12:17:40

On Tue, May 10, 2011 at 10:58 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> ANALYZE can do the stats job for 'free' on the pages it collects
> anyway. So that looks like a good idea.
> I believe the really lazy vacuum is another topic; even if it will
> improve the performance of the index only scan to have tables already
> vacuuumed, the stats should expose that and the function
> cost_index(_only?)() taking care of that.

I basically agree.  The connection is that - as we use the all-visible
for more things, the performance penalty for failing to vacuum (say)
an insert-only table will continue to grow.  Still, as you say,
clearly a separate topic.

> The temptation is high to estimate the cost of an "index_scan(only) +
> ordered(by ctid) table pages fetch if heap required". (this is what I
> understood from heikki suggestion 3-4. and it makes sense). It may be
> easier to implement both at once but I didn't find the branch in the
> Heikki's git repos. (probably removed since the long time)

I was thinking about this as well, at least if I understand you
correctly.  That would be similar to a bitmap index scan, and I think
it would be a great thing to have, not only because it would allow us
to get the advantages of index-only scans in situations that are
well-suited to our current bitmap scans, but also because it could be
batched.  You could allocate a buffer of work_mem bytes and fill it up
with TIDs; then, when it's full, you sort the buffer and start doing
the necessary heap fetches in physical order.  If you still need more
rows, you can clear the buffer and go around for another pass.

> Based on ANALYZE stats for the visibility, I believe cost_index and
> cost_index_only should be very similar functions (well, atm, I don't
> see the point to split it in 2 functions).

Yeah, I would more imagine modifying the existing function.

>> Any thoughts welcome.  Incidentally, if anyone else feels like working
>> on this, feel free to let me know and I'm happy to step away, from all
>> of it or from whatever part someone else wants to tackle.  I'm mostly
>> working on this because it's something that I think we really need to
>> get done, more than having a burning desire to be the one who does it.
>
> Indexonly scans are welcome!
> I believe I can help on 3 and 4, but (really) not sure for 1 and 2.

Well, I have code for #1, and just need reviews, and #2 shouldn't be
that hard, and with luck I'll twist Bruce's arm into doing it (*waves
to Bruce*).  So #3 and #4 are the next thing to tackle.  Any thoughts
on what/how you'd like to contribute there?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

10 May 2011, 12:28:02

2011/5/10 Robert Haas <robertmhaas@gmail.com>:
> On Tue, May 10, 2011 at 10:58 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> ANALYZE can do the stats job for 'free' on the pages it collects
>> anyway. So that looks like a good idea.
>> I believe the really lazy vacuum is another topic; even if it will
>> improve the performance of the index only scan to have tables already
>> vacuuumed, the stats should expose that and the function
>> cost_index(_only?)() taking care of that.
>
> I basically agree.  The connection is that - as we use the all-visible
> for more things, the performance penalty for failing to vacuum (say)
> an insert-only table will continue to grow.  Still, as you say,
> clearly a separate topic.
>
>> The temptation is high to estimate the cost of an "index_scan(only) +
>> ordered(by ctid) table pages fetch if heap required". (this is what I
>> understood from heikki suggestion 3-4. and it makes sense). It may be
>> easier to implement both at once but I didn't find the branch in the
>> Heikki's git repos. (probably removed since the long time)
>
> I was thinking about this as well, at least if I understand you
> correctly.  That would be similar to a bitmap index scan, and I think
> it would be a great thing to have, not only because it would allow us
> to get the advantages of index-only scans in situations that are
> well-suited to our current bitmap scans, but also because it could be
> batched.  You could allocate a buffer of work_mem bytes and fill it up
> with TIDs; then, when it's full, you sort the buffer and start doing
> the necessary heap fetches in physical order.  If you still need more
> rows, you can clear the buffer and go around for another pass.
>
>> Based on ANALYZE stats for the visibility, I believe cost_index and
>> cost_index_only should be very similar functions (well, atm, I don't
>> see the point to split it in 2 functions).
>
> Yeah, I would more imagine modifying the existing function.
>
>>> Any thoughts welcome.  Incidentally, if anyone else feels like working
>>> on this, feel free to let me know and I'm happy to step away, from all
>>> of it or from whatever part someone else wants to tackle.  I'm mostly
>>> working on this because it's something that I think we really need to
>>> get done, more than having a burning desire to be the one who does it.
>>
>> Indexonly scans are welcome!
>> I believe I can help on 3 and 4, but (really) not sure for 1 and 2.
>
> Well, I have code for #1, and just need reviews, and #2 shouldn't be
> that hard, and with luck I'll twist Bruce's arm into doing it (*waves
> to Bruce*).  So #3 and #4 are the next thing to tackle.  Any thoughts
> on what/how you'd like to contribute there?

I can provide initial patchs for cost and analyze, at least.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Simon Riggs

Date:

10 May 2011, 13:03:28

On Tue, May 10, 2011 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> So, what do we need in order to find our way to index-only scans?
>
> 1. The visibility map needs to be crash-safe.  The basic idea of
> index-only scans is that, instead of checking the heap to find out
> whether each tuple is visible, we first check the visibility map.  If

This topic has been discussed many times, yet I have never seen an
assessment that explains WHY we would want to do index-only scans.

This will be a complex addition to the codebase and one that could
introduce bugs into MVCC. It seems reasonable to look at what the
benefit of this would be, and what the use case/ benefit profile is
before we spend a long time adding this optimization.

I asked for this previously on earlier threads also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

10 May 2011, 13:08:36

>> The temptation is high to estimate the cost of an "index_scan(only) +
>> ordered(by ctid) table pages fetch if heap required". (this is what I
>> understood from heikki suggestion 3-4. and it makes sense). It may be
>> easier to implement both at once but I didn't find the branch in the
>> Heikki's git repos. (probably removed since the long time)
>
> I was thinking about this as well, at least if I understand you

yes.

> correctly.  That would be similar to a bitmap index scan, and I think
> it would be a great thing to have, not only because it would allow us
> to get the advantages of index-only scans in situations that are
> well-suited to our current bitmap scans, but also because it could be
> batched.  You could allocate a buffer of work_mem bytes and fill it up
> with TIDs; then, when it's full, you sort the buffer and start doing
> the necessary heap fetches in physical order.  If you still need more
> rows, you can clear the buffer and go around for another pass.

Issue remaining here is that we don't have 'safe' Indexonly_scan, just
indexscan with probability on the 'only'.



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

"Kevin Grittner"

Date:

10 May 2011, 13:18:01

Simon Riggs <simon@2ndQuadrant.com> wrote:
> This topic has been discussed many times, yet I have never seen an
> assessment that explains WHY we would want to do index-only scans.
In databases with this feature, it's not too unusual for a query
which uses just an index to run one or more orders of magnitude
faster than a query which has to randomly access the heap for each
index entry.  That seems like enough evidence of its possible value
in PostgreSQL to proceed to the point where benchmarks become
possible.  I'm assuming that, like all other features added as
performance optimizations, it won't be committed until there are
benchmarks showing the net benefit.
As a thought experiment, picture the relative costs of scanning a
portion of an index in index sequence, and being done, versus
scanning a portion of an index in index sequence and jumping to a
random heap access for each index entry as you go.
-Kevin

Re: the big picture for index-only scans

From

Robert Haas

Date:

10 May 2011, 13:47:09

On Tue, May 10, 2011 at 11:27 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/5/10 Robert Haas <robertmhaas@gmail.com>:
>> On Tue, May 10, 2011 at 10:58 AM, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>> ANALYZE can do the stats job for 'free' on the pages it collects
>>> anyway. So that looks like a good idea.
>>> I believe the really lazy vacuum is another topic; even if it will
>>> improve the performance of the index only scan to have tables already
>>> vacuuumed, the stats should expose that and the function
>>> cost_index(_only?)() taking care of that.
>>
>> I basically agree.  The connection is that - as we use the all-visible
>> for more things, the performance penalty for failing to vacuum (say)
>> an insert-only table will continue to grow.  Still, as you say,
>> clearly a separate topic.
>>
>>> The temptation is high to estimate the cost of an "index_scan(only) +
>>> ordered(by ctid) table pages fetch if heap required". (this is what I
>>> understood from heikki suggestion 3-4. and it makes sense). It may be
>>> easier to implement both at once but I didn't find the branch in the
>>> Heikki's git repos. (probably removed since the long time)
>>
>> I was thinking about this as well, at least if I understand you
>> correctly.  That would be similar to a bitmap index scan, and I think
>> it would be a great thing to have, not only because it would allow us
>> to get the advantages of index-only scans in situations that are
>> well-suited to our current bitmap scans, but also because it could be
>> batched.  You could allocate a buffer of work_mem bytes and fill it up
>> with TIDs; then, when it's full, you sort the buffer and start doing
>> the necessary heap fetches in physical order.  If you still need more
>> rows, you can clear the buffer and go around for another pass.
>>
>>> Based on ANALYZE stats for the visibility, I believe cost_index and
>>> cost_index_only should be very similar functions (well, atm, I don't
>>> see the point to split it in 2 functions).
>>
>> Yeah, I would more imagine modifying the existing function.
>>
>>>> Any thoughts welcome.  Incidentally, if anyone else feels like working
>>>> on this, feel free to let me know and I'm happy to step away, from all
>>>> of it or from whatever part someone else wants to tackle.  I'm mostly
>>>> working on this because it's something that I think we really need to
>>>> get done, more than having a burning desire to be the one who does it.
>>>
>>> Indexonly scans are welcome!
>>> I believe I can help on 3 and 4, but (really) not sure for 1 and 2.
>>
>> Well, I have code for #1, and just need reviews, and #2 shouldn't be
>> that hard, and with luck I'll twist Bruce's arm into doing it (*waves
>> to Bruce*).  So #3 and #4 are the next thing to tackle.  Any thoughts
>> on what/how you'd like to contribute there?
>
> I can provide initial patchs for cost and analyze, at least.

OK, cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Tom Lane

Date:

10 May 2011, 13:53:49

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>> This topic has been discussed many times, yet I have never seen an
>> assessment that explains WHY we would want to do index-only scans.
> In databases with this feature, it's not too unusual for a query
> which uses just an index to run one or more orders of magnitude
> faster than a query which has to randomly access the heap for each
> index entry.  That seems like enough evidence of its possible value
> in PostgreSQL to proceed to the point where benchmarks become
> possible.  I'm assuming that, like all other features added as
> performance optimizations, it won't be committed until there are
> benchmarks showing the net benefit.
> As a thought experiment, picture the relative costs of scanning a
> portion of an index in index sequence, and being done, versus
> scanning a portion of an index in index sequence and jumping to a
> random heap access for each index entry as you go.

It's already the case that we'll flip over to a bitmap indexscan,
and thus get rid of most/all of the "random" page accesses, in
situations where this is likely to be a big win.  Pointing to the
performance difference in databases that don't do that is therefore
not too convincing.

I'm inclined to agree that index-only scans might be worth the amount
of work that's involved ... but I share Simon's desire to see some proof
before anything gets committed.
        regards, tom lane

Re: the big picture for index-only scans

From

Robert Haas

Date:

10 May 2011, 14:07:14

On Tue, May 10, 2011 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Simon Riggs <simon@2ndQuadrant.com> wrote:
>>> This topic has been discussed many times, yet I have never seen an
>>> assessment that explains WHY we would want to do index-only scans.
>
>> In databases with this feature, it's not too unusual for a query
>> which uses just an index to run one or more orders of magnitude
>> faster than a query which has to randomly access the heap for each
>> index entry.  That seems like enough evidence of its possible value
>> in PostgreSQL to proceed to the point where benchmarks become
>> possible.  I'm assuming that, like all other features added as
>> performance optimizations, it won't be committed until there are
>> benchmarks showing the net benefit.
>
>> As a thought experiment, picture the relative costs of scanning a
>> portion of an index in index sequence, and being done, versus
>> scanning a portion of an index in index sequence and jumping to a
>> random heap access for each index entry as you go.
>
> It's already the case that we'll flip over to a bitmap indexscan,
> and thus get rid of most/all of the "random" page accesses, in
> situations where this is likely to be a big win.  Pointing to the
> performance difference in databases that don't do that is therefore
> not too convincing.
>
> I'm inclined to agree that index-only scans might be worth the amount
> of work that's involved ... but I share Simon's desire to see some proof
> before anything gets committed.

Well, we're not in the habit of committing performance patches without
performance numbers, so I doubt we'll reverse that trend now, and
certainly I had no intention of doing so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Simon Riggs

Date:

10 May 2011, 14:21:15

On Tue, May 10, 2011 at 5:17 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>
>> This topic has been discussed many times, yet I have never seen an
>> assessment that explains WHY we would want to do index-only scans.
>
> In databases with this feature, it's not too unusual for a query
> which uses just an index to run one or more orders of magnitude
> faster than a query which has to randomly access the heap for each
> index entry.  That seems like enough evidence of its possible value
> in PostgreSQL to proceed to the point where benchmarks become
> possible.  I'm assuming that, like all other features added as
> performance optimizations, it won't be committed until there are
> benchmarks showing the net benefit.
>
> As a thought experiment, picture the relative costs of scanning a
> portion of an index in index sequence, and being done, versus
> scanning a portion of an index in index sequence and jumping to a
> random heap access for each index entry as you go.

I can picture that. Regrettably, I can also picture the accesses to
the visibility map, the maintenance operations on the VM that are
needed for this and the contention that both of those will cause.

ISTM quite likely that we'll slow down writes to some extent in order
to improve this use case.

So I'm interested in knowing how broad the use case is and what the
overheads are, rather than have an "aw crap!" moment in the future
where we finish the code and only then realise its benefit footprint
is not useful. Best to start out with a clear benefit analysis other
than "other DBMS do it".

For example, will this be an index-specific tuning option
(manual/automatic), per table or an always-on feature?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

"Kevin Grittner"

Date:

10 May 2011, 14:25:58

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Simon Riggs <simon@2ndQuadrant.com> wrote:
>>> This topic has been discussed many times, yet I have never seen
>>> an assessment that explains WHY we would want to do index-only
>>> scans.
>  
>> In databases with this feature, it's not too unusual for a query
>> which uses just an index to run one or more orders of magnitude
>> faster than a query which has to randomly access the heap for
>> each index entry.  That seems like enough evidence of its
>> possible value in PostgreSQL to proceed to the point where
>> benchmarks become possible.  I'm assuming that, like all other
>> features added as performance optimizations, it won't be
>> committed until there are benchmarks showing the net benefit.
>  
>> As a thought experiment, picture the relative costs of scanning a
>> portion of an index in index sequence, and being done, versus
>> scanning a portion of an index in index sequence and jumping to a
>> random heap access for each index entry as you go.
> 
> It's already the case that we'll flip over to a bitmap indexscan,
> and thus get rid of most/all of the "random" page accesses, in
> situations where this is likely to be a big win.  Pointing to the
> performance difference in databases that don't do that is
> therefore not too convincing.
Sure.  Of course, if you're only accessing twenty thousand rows from
a table containing fifty million rows, bitmap index scans could come
out pretty close to random access times; but on the whole I agree
that the scale of benefit in PostgreSQL won't tend to match what
people see in other products.  Note that my words were "enough
evidence of its possible value in PostgreSQL to proceed to the point
where benchmarks become possible."
In particular, we might want to somehow try to make clear to people
that the very wide indexes they are accustomed to creating to allow
this optimization in other products might be inefficient compared to
creating several one-column indexes which would enable bitmap
logical operations.
> I'm inclined to agree that index-only scans might be worth the
> amount of work that's involved
So we agree there.
> ... but I share Simon's desire to see some proof before anything
> gets committed.
And we agree there.  In fact, I can't think of anyone in the
community who doesn't want to see that for *any* purported
performance enhancement.
My overall gut feel is that there will be some circumstances where
the "covering index" optmization is much faster, and some where
people expect it to be, but it isn't.  The trickiest part of this
might be developing a costing model which allows us to make the
right choice most of the time.  And even if we get it perfect, we
can expect questions about why the covering index wasn't used, and
requests for hints so they can force it.  :-(
-Kevin

Re: the big picture for index-only scans

From

Greg Stark

Date:

10 May 2011, 15:18:38

On Tue, May 10, 2011 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's already the case that we'll flip over to a bitmap indexscan,
> and thus get rid of most/all of the "random" page accesses, in
> situations where this is likely to be a big win.  Pointing to the
> performance difference in databases that don't do that is therefore
> not too convincing.

The other major effect is row size. Many databases have very wide
rows, perhaps on the order of 1kB. So the table with a million rows
might be 8GB but the index on a few key columns might only be a few
megabytes. Even if you have to read the entire index in random order
it'll likely all be cached and scan faster than the table itself.

One problem with hanging on benchmarks is that database schema design
can actually change based on what performs well. People get in the
habit of creating indexes in Oracle that are only logical when you
realize they allow the database to do an index-only scan  because they
contain extra columns that aren't actually used in where clauses but
are typically in the select list.

--
greg

Re: the big picture for index-only scans

From

Simon Riggs

Date:

10 May 2011, 15:56:27

On Tue, May 10, 2011 at 6:25 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

>> ... but I share Simon's desire to see some proof before anything
>> gets committed.
>
> And we agree there.  In fact, I can't think of anyone in the
> community who doesn't want to see that for *any* purported
> performance enhancement.

I'm not talking about eventual commit, I'm talking about the whole
process of development.

We should be focusing on improving a measurable performance issue, not
on implementing one exact design that someone thought might help.
How will we review the patch except by measuring it against the
declared performance goal? Otherwise all the various options along the
way will just be matters of opinion, instead of measurement.

From what has been said so far, the use case for this is related to
the practice of using "covered indexes", which makes me nervous
because that is an expert level tuning task on other DBMS, limiting
the range of people who get benefit. The typical speed up for
non-covered indexes will come when we access a very large table (not
in cache) via an index scan that is smaller than a bitmapindex scan.
Will we be able to gauge selectivities sufficiently accurately to be
able to pinpoint that during optimization? How will we know that the
table is not in cache? Or is this an optimisation in the executor for
a bitmapheap scan?

I'm not being negative, I'm trying to avoid arguments, blind alleys
and much wasted development if we focus on the wrong things or go to
design too early..

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

"Kevin Grittner"

Date:

10 May 2011, 16:35:59

Simon Riggs <simon@2ndQuadrant.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> 
>>> ... but I share Simon's desire to see some proof before anything
>>> gets committed.
>>
>> And we agree there.  In fact, I can't think of anyone in the
>> community who doesn't want to see that for *any* purported
>> performance enhancement.
> 
> I'm not talking about eventual commit, I'm talking about the whole
> process of development.
I'm confused -- you want to see proof that the concept works well in
PostgreSQL before development effort on it begins?  Or there is some
alternative you would like to see pursued instead?  Something else?
> From what has been said so far, the use case for this is related
> to the practice of using "covered indexes", which makes me nervous
> because that is an expert level tuning task on other DBMS
What?  On the versions of MS SQL Server and Sybase ASE I've used it
costs covered index plans against all the other plans automatically,
and picks this type of plan if the cost looks lower.  Sure, DBAs
sometimes add indexes, or add columns to indexes, in hopes that such
a plan will be chosen -- but what's new and different there?
> The typical speed up for non-covered indexes will come when we
> access a very large table (not in cache) via an index scan that is
> smaller than a bitmapindex scan. Will we be able to gauge
> selectivities sufficiently accurately to be able to pinpoint that
> during optimization? How will we know that the table is not in
> cache? Or is this an optimisation in the executor for a bitmapheap
> scan?
I would continue to object to using current cache contents for plan
choice because of plan instability and the fact that an odd initial
cache load could skew plans in a bad direction indefinitely.  I do
agree (and have already posted) that I think the hardest part of
this might be developing a good cost model.  I doubt that's an
insoluble problem, especially since it is something we can refine
over time as we gain experience with the edge cases.
-Kevin

Re: the big picture for index-only scans

From

Simon Riggs

Date:

10 May 2011, 19:24:56

On Tue, May 10, 2011 at 8:35 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>>
>>>> ... but I share Simon's desire to see some proof before anything
>>>> gets committed.
>>>
>>> And we agree there.  In fact, I can't think of anyone in the
>>> community who doesn't want to see that for *any* purported
>>> performance enhancement.
>>
>> I'm not talking about eventual commit, I'm talking about the whole
>> process of development.
>
> I'm confused -- you want to see proof that the concept works well in
> PostgreSQL before development effort on it begins?  Or there is some
> alternative you would like to see pursued instead?  Something else?

Well, I didn't ask for that and agree it would be foolish to demand
proof ahead of development.

I know this technique is effective in other DBMS, I just want to be
certain it will be effective for us before too much work is done. We
have the additional requirement for a crash safe vacuum map that needs
to be consulted, with possible contention effects. Sybase etc can
simply avoid the logical I/O, which is always a win, in or out of
cache. So the problem is quite different for us.

What I suggested was a assessment and benefit case because we normally
start with a problem and then work out how to solve it.

Normally, others come forward with the why? when? questions and it
feels like there's a bit of groupthink going on here. This looks to me
like its being approached like it was a feature, but it looks to me
like a possible optimisation, so suggest we treat it that way.

Out of concern, I don't want you to waste time on work that *may* not
be that useful in practice, and I don't want to miss improvements or
alternatives either.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

"Kevin Grittner"

Date:

10 May 2011, 20:14:53

Simon Riggs <simon@2ndQuadrant.com> wrote:
> Normally, others come forward with the why? when? questions and it
> feels like there's a bit of groupthink going on here. This looks
> to me like its being approached like it was a feature, but it
> looks to me like a possible optimisation, so suggest we treat it
> that way.
This issue has come up a great many times over the years, and there
has been much discussion around it.  The Wiki page is here:
http://wiki.postgresql.org/wiki/Index-only_scans
This currently references 11 threads on the topic.  I'd bet that by
spending a couple hours at it I could quadruple that number of
threads.  (I'd really rather not, though.)
The problem is that there are regular and fairly frequent complaints
on the list about queries which run slower than people expect
because the heap must be checked for visibility information when
matching index entries are found.  It has become enough of a
conspicuous issue that a lot of people are interested in seeing if
something can be done about it.  After much discussion, people are
trying to advance a plan to find an answer.  I'm sure nobody
involved would ignore any suggestion on how it might be done better;
but at this point, I don't think it's fair to suggest that this is
not being pursued in response to a real problem, or that no serious
thought has been given to direction before people started moving.
-Kevin

Re: the big picture for index-only scans

From

Greg Stark

Date:

10 May 2011, 20:55:33

On Wed, May 11, 2011 at 12:14 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> The problem is that there are regular and fairly frequent complaints
> on the list about queries which run slower than people expect
>

To be fair about 3/4 of them were actually complaining about the lack
of some global materialized cache of the aggregate value. Covering
index-only scans are only going to be a linear speedup no matter how
large the factor it's not going to turn select count(*) into a O(1)
operation.

I support the idea of thinking of this as an optimization. But I don't
think there's much question. If we can avoid doing the i/o on the heap
that's an obvious and huge win. Sure the costs of maintaining the vm
need to be measured against the gains but it we don't know what those
costs are yet and whoever works on it will be well aware of that
balance.

On a separate note though, Simon, I don't know what you mean by "we
normally start with a problem". It's an free software project and
people are free to work on whatever interests them whether that's
because it solves a problem they have, helps a client who's paying
them, or just because it's of academic interest to them. We don't
always take their patches if they aren't of general interest but
people propose all kinds of crazy experimental ideas all the time.

-- 
greg

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

10 May 2011, 21:19:57

Robert Haas wrote:
> >> Any thoughts welcome. ?Incidentally, if anyone else feels like working
> >> on this, feel free to let me know and I'm happy to step away, from all
> >> of it or from whatever part someone else wants to tackle. ?I'm mostly
> >> working on this because it's something that I think we really need to
> >> get done, more than having a burning desire to be the one who does it.
> >
> > Indexonly scans are welcome!
> > I believe I can help on 3 and 4, but (really) not sure for 1 and 2.
> 
> Well, I have code for #1, and just need reviews, and #2 shouldn't be
> that hard, and with luck I'll twist Bruce's arm into doing it (*waves
> to Bruce*).  So #3 and #4 are the next thing to tackle.  Any thoughts
> on what/how you'd like to contribute there?

I am happy to have pg_upgrade skip upgrading visibility map files --- it
already has code to conditionally process them because they only exist
in >= 8.4:
       /* fsm/vm files added in PG 8.4 */       if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)       {
  /*            * Copy/link any fsm and vm files, if they exist            */
 

Just give the word and it will be done.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

10 May 2011, 21:47:29

Greg Stark wrote:
> On a separate note though, Simon, I don't know what you mean by "we
> normally start with a problem". It's an free software project and
> people are free to work on whatever interests them whether that's
> because it solves a problem they have, helps a client who's paying
> them, or just because it's of academic interest to them. We don't
> always take their patches if they aren't of general interest but
> people propose all kinds of crazy experimental ideas all the time.

I am confused by Simon's questions too.  

Simon seems to regularly argue for adding features late in the
development cycle and backpatch things no one else thinks should be
backpatched, but he wants more research that index-only scans are going
to improve things before it is implemented?   The first is aggressive
development, the second is very conservative development --- they don't
match, so I now wonder what the motivation is since it isn't consistent.

Isn't speeding up COUNT(*) a sufficient case because it will not have to
touch the heap in many cases?  No one is going to apply this patch until
we fully understand the performance implications, just like every other
patch.  No one has suggested otherwise.

It is helpful to have people critically review all our work, but
disagreeing just for the sake of causing discussion isn't helpful, and I
have seen a lot of these discussions lately.  I am sensing a pattern.  :-(

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Greg Stark

Date:

10 May 2011, 21:56:25

On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Isn't speeding up COUNT(*) a sufficient case because it will not have to
> touch the heap in many cases?

Putting aside the politics questions, count(*) is an interesting case
-- it exposes some of the unanswered questions about index-only scans.

The reason "select count(*)" might win would be because we could pick
any index and do an index scan, relying on the visibility map to
optimize away the heap reads. This is only going to be a win if a
large fraction of the heap reads get optimized away.

It's going to be pretty tricky to determine in the optimizer a) which
index will be cheapest and b) what fraction of index tuples will point
to pages where the heap reference can be optimized away. The penalty
for guessing wrong if we use an index-only scan and it turns out to
have many pages that aren't all-visible would be pretty high.

-- 
greg

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

10 May 2011, 22:01:59

Greg Stark wrote:
> On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Isn't speeding up COUNT(*) a sufficient case because it will not have to
> > touch the heap in many cases?
> 
> Putting aside the politics questions, count(*) is an interesting case
> -- it exposes some of the unanswered questions about index-only scans.
> 
> The reason "select count(*)" might win would be because we could pick
> any index and do an index scan, relying on the visibility map to
> optimize away the heap reads. This is only going to be a win if a
> large fraction of the heap reads get optimized away.
> 
> It's going to be pretty tricky to determine in the optimizer a) which
> index will be cheapest and b) what fraction of index tuples will point
> to pages where the heap reference can be optimized away. The penalty
> for guessing wrong if we use an index-only scan and it turns out to
> have many pages that aren't all-visible would be pretty high.

Yes, that is the tricky optimizer/analyze part.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

10 May 2011, 22:34:38

Robert Haas wrote:
> So, what do we need in order to find our way to index-only scans?
> 
> 1. The visibility map needs to be crash-safe.  The basic idea of
> index-only scans is that, instead of checking the heap to find out
> whether each tuple is visible, we first check the visibility map.  If
> the visibility map bit is set, then we know all tuples on the page are
> visible to all transactions, and therefore the tuple of interest is
> visible to our transaction.  Assuming that a significant number of
> visibility map bits are set, this should enable us to avoid a fair
> amount of I/O, especially on large tables, because the visibility map
> is roughly 8000 times smaller than the heap, and therefore far more
> practical to keep in cache.  However, before we can rely on the

FYI, because the visibility map is only one _bit_ per page, it is 8000 *
8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of
heap pages.  This is important because we rely on this compactness in
hope that the WAL logging of this information will not be burdensome.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

10 May 2011, 22:41:06

Greg Stark wrote:
> On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Isn't speeding up COUNT(*) a sufficient case because it will not have to
> > touch the heap in many cases?
> 
> Putting aside the politics questions, count(*) is an interesting case
> -- it exposes some of the unanswered questions about index-only scans.
> 
> The reason "select count(*)" might win would be because we could pick
> any index and do an index scan, relying on the visibility map to
> optimize away the heap reads. This is only going to be a win if a
> large fraction of the heap reads get optimized away.
> 
> It's going to be pretty tricky to determine in the optimizer a) which
> index will be cheapest and b) what fraction of index tuples will point

I assume the smallest non-partial index would be the cheapest index.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Tom Lane

Date:

10 May 2011, 23:30:07

Bruce Momjian <bruce@momjian.us> writes:
> Greg Stark wrote:
>> Putting aside the politics questions, count(*) is an interesting case
>> -- it exposes some of the unanswered questions about index-only scans.
>> 
>> The reason "select count(*)" might win would be because we could pick
>> any index and do an index scan, relying on the visibility map to
>> optimize away the heap reads. This is only going to be a win if a
>> large fraction of the heap reads get optimized away.
>> 
>> It's going to be pretty tricky to determine in the optimizer a) which
>> index will be cheapest and b) what fraction of index tuples will point

> I assume the smallest non-partial index would be the cheapest index.

That will be true only if you intentionally ignore the points Greg
raised.  If the table isn't entirely ALL_VISIBLE, then the choice of
index will determine the ordering of the actual table probes that occur.
There could be more or fewer page reads, in a more or less optimal
order, depending on the index used.
        regards, tom lane

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

11 May 2011, 00:01:04

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Greg Stark wrote:
> >> Putting aside the politics questions, count(*) is an interesting case
> >> -- it exposes some of the unanswered questions about index-only scans.
> >> 
> >> The reason "select count(*)" might win would be because we could pick
> >> any index and do an index scan, relying on the visibility map to
> >> optimize away the heap reads. This is only going to be a win if a
> >> large fraction of the heap reads get optimized away.
> >> 
> >> It's going to be pretty tricky to determine in the optimizer a) which
> >> index will be cheapest and b) what fraction of index tuples will point
> 
> > I assume the smallest non-partial index would be the cheapest index.
> 
> That will be true only if you intentionally ignore the points Greg
> raised.  If the table isn't entirely ALL_VISIBLE, then the choice of
> index will determine the ordering of the actual table probes that occur.
> There could be more or fewer page reads, in a more or less optimal
> order, depending on the index used.

OK, would the clustering analyze stats (pg_stats.correlation) tell us
that?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Jesper Krogh

Date:

11 May 2011, 01:45:21

On 2011-05-11 01:54, Greg Stark wrote:<br /><blockquote cite="mid:BANLkTi=XvEA24K=LNNUg=SwyrxK-p-U0nw@mail.gmail.com"
type="cite"><prewrap="">
 
To be fair about 3/4 of them were actually complaining about the lack
of some global materialized cache of the aggregate value. Covering
index-only scans are only going to be a linear speedup no matter how
large the factor it's not going to turn select count(*) into a O(1)
operation.
</pre></blockquote> Actually, if we could get to count(*) into the situation of a <br /> very "thin row" today, so the
costof visibillity-testing didn't depend<br /> hugely on the width of the row any more, then we be half-<br />
way-therein terms of performance optimization. <br /><br /> If rows typically just were tuple-headers + a bit more,
thenit <br /> would be way harder to go down this road and claim good <br /> benefits. But currently the system needs
todrag in "allmost" <br /> one page pr. visibillity test from disk on "random large tables". <br /><br /> I tried to
graphthe differences of thin vs. wide rows here:<br /><a href="http://shrek.krogh.cc/%7Ejesper/visibillitytesting.pdf"
rel="nofollow"target="_top"><span>http://shrek.<b class="highlight">krogh</b>.cc/~<b class="highlight">jesper</b>/<b
class="highlight">visibillitytesting</b>.pdf</span></a><br/><a class="moz-txt-link-freetext"
href="http://postgresql.1045698.n5.nabble.com/Visibillity-testing-some-numbers-on-current-performance-td4284836.html">http://postgresql.1045698.n5.nabble.com/Visibillity-testing-some-numbers-on-current-performance-td4284836.html</a><br
/><br/> The getting the visibillitymap down to an O(n) is "on large tables"<br /> shifting to be memory based vs.
disk-basedas now. <br /><br /> Jesper (It not a goal, but it would most-likely postpone some<br /> peoples needs for
buyinga FusionIO card or similar)<br /> -- <br /> Jesper<br />

Re: the big picture for index-only scans

From

Simon Riggs

Date:

11 May 2011, 04:44:23

On Wed, May 11, 2011 at 12:54 AM, Greg Stark <gsstark@mit.edu> wrote:

> On a separate note though, Simon, I don't know what you mean by "we
> normally start with a problem". It's an free software project and
> people are free to work on whatever interests them whether that's
> because it solves a problem they have, helps a client who's paying
> them, or just because it's of academic interest to them. We don't
> always take their patches if they aren't of general interest but
> people propose all kinds of crazy experimental ideas all the time.

Completely agree, but why are you saying that to me?

When Tom asks me why I suggest something, nobody tells him "its a free
software project etc...".

What is the difference here?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

Simon Riggs

Date:

11 May 2011, 05:05:15

On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Greg Stark wrote:
>> On a separate note though, Simon, I don't know what you mean by "we
>> normally start with a problem". It's an free software project and
>> people are free to work on whatever interests them whether that's
>> because it solves a problem they have, helps a client who's paying
>> them, or just because it's of academic interest to them. We don't
>> always take their patches if they aren't of general interest but
>> people propose all kinds of crazy experimental ideas all the time.
>
> I am confused by Simon's questions too.
>
> Simon seems to regularly argue for adding features late in the
> development cycle and backpatch things no one else thinks should be
> backpatched, but he wants more research that index-only scans are going
> to improve things before it is implemented?   The first is aggressive
> development, the second is very conservative development --- they don't
> match, so I now wonder what the motivation is since it isn't consistent.

Not really sure why reasonable technical skepticism should become
personal commentary.

You don't question Tom's motives if he is skeptical of an idea of
mine. Why would you question my motivation? What is *your* motive for
acting like that?

I'm not driven by one setting of "conservatism", but I am interested
in adding fully usable features that bring credit to the project. If I
see a feature that can have minor things added to it to improve them,
then I raise that during beta. If I see things being worked out that
sounds dubious, I mention that in early development.

I don't think this work will materially improve the speed of count(*)
in majority of cases. This introduces extra overhead into the code
path and that can be a net loss. The only time it will help is when
you have a large table that is not cached and also not recently
updated. Is count(*) run very often against such tables? Do we really
care enough to optimise that use case with lots of special purpose
code? The very fact that Kevin and yourself bring up different reasons
for why we need this feature makes me nervous.

The analysis has not been done yet, and all I have done is request that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

Simon Riggs

Date:

11 May 2011, 05:08:56

On Wed, May 11, 2011 at 2:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> So, what do we need in order to find our way to index-only scans?
>>
>> 1. The visibility map needs to be crash-safe.  The basic idea of
>> index-only scans is that, instead of checking the heap to find out
>> whether each tuple is visible, we first check the visibility map.  If
>> the visibility map bit is set, then we know all tuples on the page are
>> visible to all transactions, and therefore the tuple of interest is
>> visible to our transaction.  Assuming that a significant number of
>> visibility map bits are set, this should enable us to avoid a fair
>> amount of I/O, especially on large tables, because the visibility map
>> is roughly 8000 times smaller than the heap, and therefore far more
>> practical to keep in cache.  However, before we can rely on the
>
> FYI, because the visibility map is only one _bit_ per page, it is 8000 *
> 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of
> heap pages.  This is important because we rely on this compactness in
> hope that the WAL logging of this information will not be burdensome.

We would need to issue one WAL record per bit, not per page.

I'm concerned about the path length added by VM visits and the
potential contention that concentration of information will bring.

Those aren't things to be dismissed without calculation and analysis.
There might not be an issue there, but its worth checking.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: the big picture for index-only scans

From

Heikki Linnakangas

Date:

11 May 2011, 05:53:38

On 10.05.2011 20:15, Simon Riggs wrote:
> On Tue, May 10, 2011 at 5:17 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov>  wrote:
>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>>
>>> This topic has been discussed many times, yet I have never seen an
>>> assessment that explains WHY we would want to do index-only scans.
>>
>> In databases with this feature, it's not too unusual for a query
>> which uses just an index to run one or more orders of magnitude
>> faster than a query which has to randomly access the heap for each
>> index entry.  That seems like enough evidence of its possible value
>> in PostgreSQL to proceed to the point where benchmarks become
>> possible.  I'm assuming that, like all other features added as
>> performance optimizations, it won't be committed until there are
>> benchmarks showing the net benefit.
>>
>> As a thought experiment, picture the relative costs of scanning a
>> portion of an index in index sequence, and being done, versus
>> scanning a portion of an index in index sequence and jumping to a
>> random heap access for each index entry as you go.
>
> I can picture that. Regrettably, I can also picture the accesses to
> the visibility map, the maintenance operations on the VM that are
> needed for this and the contention that both of those will cause.

Note that we already have the visibility map, and the accesses needed to 
update it are already there. Granted, we'll have to change the logic 
slightly to make it crash safe, but I don't expect that to add any 
meaningful overhead - the changes are going to be where the bits are 
set, ie. vacuum, not when the bits are cleared. Granted, we might also 
want to set the bits more aggressively once they're used by 
index-only-scans. But done correctly, just taking advantage of the VM 
that's already there shouldn't add overhead to other operations.

I agree that we need to do tests to demonstrate that there's a gain from 
the patch, once we have a patch to test. I would be very surprised if 
there isn't, but that just means the testing is going to be easy.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: the big picture for index-only scans

From

Nicolas Barbier

Date:

11 May 2011, 06:33:10

2011/5/11, Bruce Momjian <bruce@momjian.us>:

> FYI, because the visibility map is only one _bit_ per page, it is 8000 *
> 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of
> heap pages.

Actually, that would be "one 8kB block covers 512MB of heap": 1 block
of visibility map (8kB) = 64k visibility bits = covers 64k blocks =
covers 512MB of heap. The cost of keeping the visibility map in cache
is therefore totally negligible, only the cost of WAL logging changes
to it is of interest.

> This is important because we rely on this compactness in hope that
> the WAL logging of this information will not be burdensome.

The size of on entry in the map (1 bit) is not very related to the WAL
overhead required per change of such a bit (i.e., the log record for a
1 bit change will certainly be way more than 1 bit).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

11 May 2011, 06:51:58

2011/5/11 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> On 10.05.2011 20:15, Simon Riggs wrote:
>>
>> On Tue, May 10, 2011 at 5:17 PM, Kevin Grittner
>> <Kevin.Grittner@wicourts.gov>  wrote:
>>>
>>> Simon Riggs<simon@2ndQuadrant.com>  wrote:
>>>
>>>> This topic has been discussed many times, yet I have never seen an
>>>> assessment that explains WHY we would want to do index-only scans.
>>>
>>> In databases with this feature, it's not too unusual for a query
>>> which uses just an index to run one or more orders of magnitude
>>> faster than a query which has to randomly access the heap for each
>>> index entry.  That seems like enough evidence of its possible value
>>> in PostgreSQL to proceed to the point where benchmarks become
>>> possible.  I'm assuming that, like all other features added as
>>> performance optimizations, it won't be committed until there are
>>> benchmarks showing the net benefit.
>>>
>>> As a thought experiment, picture the relative costs of scanning a
>>> portion of an index in index sequence, and being done, versus
>>> scanning a portion of an index in index sequence and jumping to a
>>> random heap access for each index entry as you go.
>>
>> I can picture that. Regrettably, I can also picture the accesses to
>> the visibility map, the maintenance operations on the VM that are
>> needed for this and the contention that both of those will cause.
>
> Note that we already have the visibility map, and the accesses needed to
> update it are already there. Granted, we'll have to change the logic
> slightly to make it crash safe, but I don't expect that to add any
> meaningful overhead - the changes are going to be where the bits are set,
> ie. vacuum, not when the bits are cleared. Granted, we might also want to
> set the bits more aggressively once they're used by index-only-scans. But
> done correctly, just taking advantage of the VM that's already there
> shouldn't add overhead to other operations.

We won't be able to do index-only scan.
We can do index scan with probability to not access heap,
maybe(hopefully) completely in some cases.

IF vis map is ok to remove the need to access heap (perf and safe),
then, for the cost part:
Currently, cost_index materialize the cost to access each heap page by
a random_page_cost. I believe we should be able to change that to
remove the estimated number of heap page we don't need to access (can
be 100% or 0.1%).

And as suggested Simon, there is also maybe a path to improve the
bitmapheap scan. bitmapheap scan have already some workaround to be
sure indexscan looks cheaper in some case, just keep that and apply
same logic than for cost_index.

This is keeping the same rule PostgreSQL has : let the planner decide
the best solution instead of allowing special index declaration (it
hasn't been propose yet I think, but, well, just in case it pops into
the mind of someone)

>
> I agree that we need to do tests to demonstrate that there's a gain from the
> patch, once we have a patch to test. I would be very surprised if there
> isn't, but that just means the testing is going to be easy.
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

11 May 2011, 07:10:11

2011/5/10 Kevin Grittner <Kevin.Grittner@wicourts.gov>:
> Simon Riggs <simon@2ndQuadrant.com> wrote:
>> The typical speed up for non-covered indexes will come when we
>> access a very large table (not in cache) via an index scan that is
>> smaller than a bitmapindex scan. Will we be able to gauge
>> selectivities sufficiently accurately to be able to pinpoint that
>> during optimization? How will we know that the table is not in
>> cache? Or is this an optimisation in the executor for a bitmapheap
>> scan?
>
> I would continue to object to using current cache contents for plan
> choice because of plan instability and the fact that an odd initial
> cache load could skew plans in a bad direction indefinitely.  I do
> agree (and have already posted) that I think the hardest part of
> this might be developing a good cost model.  I doubt that's an
> insoluble problem, especially since it is something we can refine
> over time as we gain experience with the edge cases.

you will have the same possible instability in planning with the
index(-only?) scan because we may need to access heap anyway and this
needs is based on estimation, or I miss something ? I understood the
idea was just to bypass the heap access *if* we can for *this*
heap-page.

In reality, I am not really scared by plan instability because of a
possible PG/OS cache estimation. The percentages remain stable in my
observations ... I don't know yet how it will go for vis map.

And, we already have plan instability currently, which is *good* : at
some point a seq scan is better than an bitmap heap scan. Because the
relation size change and because ANALYZE re-estimate the distribution
of the data. I will be very happy to issue ANALYZE CACHE as I have to
ANALYZE temp table for large query if it allows the planner to provide
me the best plan in some scenario...but this is another topic, sorry
for the digression..

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Robert Haas

Date:

11 May 2011, 07:55:35

On Tue, May 10, 2011 at 9:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> So, what do we need in order to find our way to index-only scans?
>>
>> 1. The visibility map needs to be crash-safe.  The basic idea of
>> index-only scans is that, instead of checking the heap to find out
>> whether each tuple is visible, we first check the visibility map.  If
>> the visibility map bit is set, then we know all tuples on the page are
>> visible to all transactions, and therefore the tuple of interest is
>> visible to our transaction.  Assuming that a significant number of
>> visibility map bits are set, this should enable us to avoid a fair
>> amount of I/O, especially on large tables, because the visibility map
>> is roughly 8000 times smaller than the heap, and therefore far more
>> practical to keep in cache.  However, before we can rely on the
>
> FYI, because the visibility map is only one _bit_ per page, it is 8000 *
> 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of
> heap pages.  This is important because we rely on this compactness in
> hope that the WAL logging of this information will not be burdensome.

I accuse you of bad math.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

11 May 2011, 07:57:59

On Tue, May 10, 2011 at 10:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That will be true only if you intentionally ignore the points Greg
> raised.  If the table isn't entirely ALL_VISIBLE, then the choice of
> index will determine the ordering of the actual table probes that occur.
> There could be more or fewer page reads, in a more or less optimal
> order, depending on the index used.

However, note that this wasn't one of the cases I said I was going to
try to optimize in the first go-around anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

11 May 2011, 08:50:54

On Wed, May 11, 2011 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Completely agree, but why are you saying that to me?
>
> When Tom asks me why I suggest something, nobody tells him "its a free
> software project etc...".
>
> What is the difference here?

We're now 40 emails in this thread, and there seems to be far more
heat than light here.  Here's an attempt at a summary:

- Simon wants proof that the performance benefit of this feature is
worth any downsides it may have, which is standard procedure, and
isn't certain the feature will have a significant performance benefit.
- Numerous other people think Simon's doubts about the feature are
poorly justified (and some of them also think he's being a pain in the
neck).
- Various peripherally related topics, such as optimizing count(*),
which is not part of the vision for the first go-round that I sketched
out in my OP, and plan stability, which is another issue entirely,
have been discussed.
- Meanwhile, only one person has done any review of the actual code
that's been written, which is posted on the crash-safe visibility map
thread, which may be why multiple people seem confused about what it
does.
- And no one has done any benchmarking of that code.

I think it would be really helpful if some more people would review
the crash-safe visibility map patch, and if at least one person could
benchmark it.  It would be useful to know (a) whether that noticeably
slows down the system during inserts, updates, and deletes, especially
at very high concurrency; and (b) how much of an impact the additional
WAL-logging has on VACUUM.  On the other hand, arguing about whether
index-only scans are going to result in a significant performance
benefit is not useful.  I am going to be both surprised and
disappointed if they don't, but there's only one way to find out, and
a theoretical argument isn't it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

11 May 2011, 10:55:49

Nicolas Barbier wrote:
> 2011/5/11, Bruce Momjian <bruce@momjian.us>:
> 
> > FYI, because the visibility map is only one _bit_ per page, it is 8000 *
> > 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of
> > heap pages.
> 
> Actually, that would be "one 8kB block covers 512MB of heap": 1 block
> of visibility map (8kB) = 64k visibility bits = covers 64k blocks =
> covers 512MB of heap. The cost of keeping the visibility map in cache
> is therefore totally negligible, only the cost of WAL logging changes
> to it is of interest.

Ah, yes, thanks, even better.

> > This is important because we rely on this compactness in hope that
> > the WAL logging of this information will not be burdensome.
> 
> The size of on entry in the map (1 bit) is not very related to the WAL
> overhead required per change of such a bit (i.e., the log record for a
> 1 bit change will certainly be way more than 1 bit).

True.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

11 May 2011, 11:18:11

Simon Riggs wrote:
> On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Greg Stark wrote:
> >> On a separate note though, Simon, I don't know what you mean by "we
> >> normally start with a problem". It's an free software project and
> >> people are free to work on whatever interests them whether that's
> >> because it solves a problem they have, helps a client who's paying
> >> them, or just because it's of academic interest to them. We don't
> >> always take their patches if they aren't of general interest but
> >> people propose all kinds of crazy experimental ideas all the time.
> >
> > I am confused by Simon's questions too.
> >
> > Simon seems to regularly argue for adding features late in the
> > development cycle and backpatch things no one else thinks should be
> > backpatched, but he wants more research that index-only scans are going
> > to improve things before it is implemented? ? The first is aggressive
> > development, the second is very conservative development --- they don't
> > match, so I now wonder what the motivation is since it isn't consistent.
> 
> Not really sure why reasonable technical skepticism should become
> personal commentary.
> 
> You don't question Tom's motives if he is skeptical of an idea of
> mine. Why would you question my motivation? What is *your* motive for
> acting like that?

Tom is consistent in his level of aggressive/conservative development
suggestions.  What I am seeing are many cases where you are consistently
pushing for something even though you get almost-overwhelming rejection,
and you keep going.  And if it was consistent in one direction, I could
understand because maybe you feel we are too conservative, but if it
isn't consistent, I have no idea how to learn or adjust to your
approach.  We clearly have some people on one side of the
conservative/agressive specturm, and some on the other side.

Now, I am willing to admit I might be totally wrong, but it has risen to
a level that I felt I should say something in case it is helpful.
> I'm not driven by one setting of "conservatism", but I am interested
> in adding fully usable features that bring credit to the project. If I
> see a feature that can have minor things added to it to improve them,
> then I raise that during beta. If I see things being worked out that
> sounds dubious, I mention that in early development.

Yes, that seems fine to me, as stated.

> I don't think this work will materially improve the speed of count(*)
> in majority of cases. This introduces extra overhead into the code

I think this is the only hope we have of improving count(*) in an active
MVCC system.  It might not work, but it has been our only hope of
improvement of count(*) for a while.

> path and that can be a net loss. The only time it will help is when
> you have a large table that is not cached and also not recently
> updated. Is count(*) run very often against such tables? Do we really
> care enough to optimise that use case with lots of special purpose
> code? The very fact that Kevin and yourself bring up different reasons
> for why we need this feature makes me nervous.

Yes, no question.  For count(*), you don't care about the indexed
values, only the count, while for Kevin's case you are reading values
from the index.  I assume (or hope) that one or both will be a win for
this feature.

> The analysis has not been done yet, and all I have done is request that.

I think we are going to have to write the code and see the performance
hit and where it is a win.  Ideally we could figure this out
before-hand, but I don't think that is possible in this case.  If you
look at the research in reducing the load of updating the hint bits,
again, it is so complex that only working code and testing is showing if
there is possible improvement there.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

11 May 2011, 11:23:03

C�dric Villemain wrote:
> 2011/5/10 Kevin Grittner <Kevin.Grittner@wicourts.gov>:
> > Simon Riggs <simon@2ndQuadrant.com> wrote:
> >> The typical speed up for non-covered indexes will come when we
> >> access a very large table (not in cache) via an index scan that is
> >> smaller than a bitmapindex scan. Will we be able to gauge
> >> selectivities sufficiently accurately to be able to pinpoint that
> >> during optimization? How will we know that the table is not in
> >> cache? Or is this an optimisation in the executor for a bitmapheap
> >> scan?
> >
> > I would continue to object to using current cache contents for plan
> > choice because of plan instability and the fact that an odd initial
> > cache load could skew plans in a bad direction indefinitely. ?I do
> > agree (and have already posted) that I think the hardest part of
> > this might be developing a good cost model. ?I doubt that's an
> > insoluble problem, especially since it is something we can refine
> > over time as we gain experience with the edge cases.
> 
> you will have the same possible instability in planning with the
> index(-only?) scan because we may need to access heap anyway and this
> needs is based on estimation, or I miss something ? I understood the
> idea was just to bypass the heap access *if* we can for *this*
> heap-page.
> 
> In reality, I am not really scared by plan instability because of a
> possible PG/OS cache estimation. The percentages remain stable in my
> observations ... I don't know yet how it will go for vis map.
> 
> And, we already have plan instability currently, which is *good* : at
> some point a seq scan is better than an bitmap heap scan. Because the
> relation size change and because ANALYZE re-estimate the distribution
> of the data. I will be very happy to issue ANALYZE CACHE as I have to
> ANALYZE temp table for large query if it allows the planner to provide
> me the best plan in some scenario...but this is another topic, sorry
> for the digression..

Good point --- we would be making plan decisions based on the visibility
map coverage.  The big question is whether visibility map changes are
more dynamic than the values we already plan against, like rows in the
table, table size, and value distributions.  I don't know the answer.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

"Kevin Grittner"

Date:

11 May 2011, 11:38:30

Bruce Momjian <bruce@momjian.us> wrote:
>> The very fact that Kevin and yourself bring up different reasons
>> for why we need this feature makes me nervous.
> 
> Yes, no question.  For count(*), you don't care about the indexed
> values, only the count, while for Kevin's case you are reading
> values from the index.
[sigh]  I'm reluctant to draw out this digression further, but there
is a possibly-useful point to be made here: these are not two
different things.  A covering index can be considered whenever the
set of columns referenced in the query is contained inside the set
of columns in the index.  The fact that the set of columns needed by
count(*) is the empty set merely means that it is covered by any
index, since the empty set is contained in every set.
Now, this special case may make for an easy initial target in
implementation, or allow early benchmarking.  If so, all the better
to go there first.  I'm not sure why anyone would stop there,
though; if it pays off for that simple case it is likely to pay off
for the more general case, too.
-Kevin

Re: the big picture for index-only scans

From

Tom Lane

Date:

11 May 2011, 11:41:16

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 10.05.2011 20:15, Simon Riggs wrote:
>> I can picture that. Regrettably, I can also picture the accesses to
>> the visibility map, the maintenance operations on the VM that are
>> needed for this and the contention that both of those will cause.

> I agree that we need to do tests to demonstrate that there's a gain from 
> the patch, once we have a patch to test. I would be very surprised if 
> there isn't, but that just means the testing is going to be easy.

I think Simon's point is that showing a gain on specific test cases
isn't a sufficient argument.  What we need to know about this sort of
change is what is the distributed overhead that is going to be paid by
*everybody*, whether their queries benefit from the optimization or not.
And what fraction of real-world queries really do benefit, and to what
extent.  Isolated test cases (undoubtedly chosen to show off the
optimization) are not adequate to form a picture of the overall cost and
benefit.
        regards, tom lane

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

11 May 2011, 11:45:24

Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > On 10.05.2011 20:15, Simon Riggs wrote:
> >> I can picture that. Regrettably, I can also picture the accesses to
> >> the visibility map, the maintenance operations on the VM that are
> >> needed for this and the contention that both of those will cause.
> 
> > I agree that we need to do tests to demonstrate that there's a gain from 
> > the patch, once we have a patch to test. I would be very surprised if 
> > there isn't, but that just means the testing is going to be easy.
> 
> I think Simon's point is that showing a gain on specific test cases
> isn't a sufficient argument.  What we need to know about this sort of
> change is what is the distributed overhead that is going to be paid by
> *everybody*, whether their queries benefit from the optimization or not.
> And what fraction of real-world queries really do benefit, and to what
> extent.  Isolated test cases (undoubtedly chosen to show off the
> optimization) are not adequate to form a picture of the overall cost and
> benefit.

Yes, I assume we are going to need the same kind of tests we did for
other invasive patches like serializable isolation level and hot
standby.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: the big picture for index-only scans

From

"Kevin Grittner"

Date:

11 May 2011, 12:52:00

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think Simon's point is that showing a gain on specific test
> cases isn't a sufficient argument.
Ah, if that's what he's been trying to get at, I'm curious who
disagrees with that.  I wouldn't have thought anyone on this list
would.
> What we need to know about this sort of change is what is the
> distributed overhead that is going to be paid by *everybody*,
> whether their queries benefit from the optimization or not.
Certainly we need to test whether Heikki is right in the previously
non-quoted part of his post on this thread:
>> Note that we already have the visibility map, and the accesses
>> needed to update it are already there. Granted, we'll have to
>> change the logic slightly to make it crash safe, but I don't
>> expect that to add any meaningful overhead - the changes are
>> going to be where the bits are set, ie. vacuum, not when the bits
>> are cleared. Granted, we might also want to set the bits more
>> aggressively once they're used by index-only-scans. But done
>> correctly, just taking advantage of the VM that's already there
>> shouldn't add overhead to other operations.
> Isolated test cases (undoubtedly chosen to show off the
> optimization) are not adequate to form a picture of the overall
> cost and benefit.
Well, first, that hardly seems fair.  I have many times seen people
make an effort to synthesize *worst* case benchmarks.  Certainly any
regular on this list would know it is pointless to show only a best
case benchmark.
Second, we really need to make development of a performance testing
farm a priority at PGCon next week.  The need for it just keeps
coming up over and over.
Third, Dan Ports has been working a great deal with DBT-2 running
PostgreSQL for the SSI patch, both as a stress tool to flush out
bugs and to get benchmarks numbers conforming to the published
requirements of that benchmark.  I know from off-list emails that it
took a fair amount of work to get it running correctly with
PostgreSQL in his environment.  We should probably try to draw on
that experience.  (Of course that shouldn't be the *only* test in a
performance testing farm, but it's a good one to include.)
-Kevin

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

13 May 2011, 19:34:53

2011/5/11 Robert Haas <robertmhaas@gmail.com>:
> On Wed, May 11, 2011 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Completely agree, but why are you saying that to me?
>>
>> When Tom asks me why I suggest something, nobody tells him "its a free
>> software project etc...".
>>
>> What is the difference here?
>
> We're now 40 emails in this thread, and there seems to be far more
> heat than light here.  Here's an attempt at a summary:
>
> - Simon wants proof that the performance benefit of this feature is
> worth any downsides it may have, which is standard procedure, and
> isn't certain the feature will have a significant performance benefit.
> - Numerous other people think Simon's doubts about the feature are
> poorly justified (and some of them also think he's being a pain in the
> neck).
> - Various peripherally related topics, such as optimizing count(*),
> which is not part of the vision for the first go-round that I sketched
> out in my OP, and plan stability, which is another issue entirely,
> have been discussed.
> - Meanwhile, only one person has done any review of the actual code
> that's been written, which is posted on the crash-safe visibility map
> thread, which may be why multiple people seem confused about what it
> does.
> - And no one has done any benchmarking of that code.

Will you be able to do some ? or can you propose a simple process to
do efficient benchmark of the patch ?

If reviewers agree it is safe and benchmarks make evidences that no
basic performance  issue are raised, then let's see if next items have
blockers or can be done.

>
> I think it would be really helpful if some more people would review
> the crash-safe visibility map patch, and if at least one person could
> benchmark it.  It would be useful to know (a) whether that noticeably
> slows down the system during inserts, updates, and deletes, especially
> at very high concurrency; and (b) how much of an impact the additional
> WAL-logging has on VACUUM.  On the other hand, arguing about whether
> index-only scans are going to result in a significant performance
> benefit is not useful.  I am going to be both surprised and
> disappointed if they don't, but there's only one way to find out, and
> a theoretical argument isn't it.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Robert Haas

Date:

13 May 2011, 22:00:08

On Fri, May 13, 2011 at 6:34 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> Will you be able to do some ? or can you propose a simple process to
> do efficient benchmark of the patch ?

I will probably do some benchmarking at some point, unless someone
else goes nuts and makes it moot before I get to that point.  I think
the main thing is to just apply the patch and beat up the server, and
see if it's any slower than it was before.

> If reviewers agree it is safe and benchmarks make evidences that no
> basic performance  issue are raised, then let's see if next items have
> blockers or can be done.

Sounds right to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

19 June 2011, 11:44:56

2011/5/14 Robert Haas <robertmhaas@gmail.com>:
> On Fri, May 13, 2011 at 6:34 PM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> If reviewers agree it is safe and benchmarks make evidences that no
>> basic performance  issue are raised, then let's see if next items have
>> blockers or can be done.
>
> Sounds right to me.

and recent stuff on VM will allow us to move forward it seems !

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

19 June 2011, 12:13:06

2011/5/11 Bruce Momjian <bruce@momjian.us>:
> Cédric Villemain wrote:
>> 2011/5/10 Kevin Grittner <Kevin.Grittner@wicourts.gov>:
>> > Simon Riggs <simon@2ndQuadrant.com> wrote:
>> >> The typical speed up for non-covered indexes will come when we
>> >> access a very large table (not in cache) via an index scan that is
>> >> smaller than a bitmapindex scan. Will we be able to gauge
>> >> selectivities sufficiently accurately to be able to pinpoint that
>> >> during optimization? How will we know that the table is not in
>> >> cache? Or is this an optimisation in the executor for a bitmapheap
>> >> scan?
>> >
>> > I would continue to object to using current cache contents for plan
>> > choice because of plan instability and the fact that an odd initial
>> > cache load could skew plans in a bad direction indefinitely. ?I do
>> > agree (and have already posted) that I think the hardest part of
>> > this might be developing a good cost model. ?I doubt that's an
>> > insoluble problem, especially since it is something we can refine
>> > over time as we gain experience with the edge cases.
>>
>> you will have the same possible instability in planning with the
>> index(-only?) scan because we may need to access heap anyway and this
>> needs is based on estimation, or I miss something ? I understood the
>> idea was just to bypass the heap access *if* we can for *this*
>> heap-page.
>>
>> In reality, I am not really scared by plan instability because of a
>> possible PG/OS cache estimation. The percentages remain stable in my
>> observations ... I don't know yet how it will go for vis map.
>>
>> And, we already have plan instability currently, which is *good* : at
>> some point a seq scan is better than an bitmap heap scan. Because the
>> relation size change and because ANALYZE re-estimate the distribution
>> of the data. I will be very happy to issue ANALYZE CACHE as I have to
>> ANALYZE temp table for large query if it allows the planner to provide
>> me the best plan in some scenario...but this is another topic, sorry
>> for the digression..
>
> Good point --- we would be making plan decisions based on the visibility
> map coverage.  The big question is whether visibility map changes are
> more dynamic than the values we already plan against, like rows in the
> table, table size, and value distributions.  I don't know the answer.
>

Robert, I though of Covered-Index as just a usage of the vis map:
don't take the heap block if not needed. This look easier to do and
better in the long term (because read-only DB may quickly turn into a
no-heap access DB for example). Thus this is not real covered-index.
Did you want to implement real covered-index and did you have ideas on
how to do that ? Or just an optimization of the current
planner/executor on index usage ?
___

I don't know VM internals:
* do we have a counter of ALL_VISIBLE flag set on a relation ? (this
should be very good for planner)* do we need a pg_class.rel_vmvisible ?! (I have hands up, don't
shoot pleeaase)* is it ok to parse VM for planning (I believe it is not) ?

Ideas ?

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 June 2011, 15:18:06

On Sun, Jun 19, 2011 at 10:44 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> and recent stuff on VM will allow us to move forward it seems !

Yep, looks promising.  The heap_hot_search_buffer refactoring patch is
related as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 June 2011, 15:40:55

On Sun, Jun 19, 2011 at 11:12 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
>> Good point --- we would be making plan decisions based on the visibility
>> map coverage.  The big question is whether visibility map changes are
>> more dynamic than the values we already plan against, like rows in the
>> table, table size, and value distributions.  I don't know the answer.
>
> Robert, I though of Covered-Index as just a usage of the vis map:
> don't take the heap block if not needed. This look easier to do and
> better in the long term (because read-only DB may quickly turn into a
> no-heap access DB for example). Thus this is not real covered-index.
> Did you want to implement real covered-index and did you have ideas on
> how to do that ? Or just an optimization of the current
> planner/executor on index usage ?

If by a "real" covered index you mean one that includes visibility
info in the index - I have no plans to work on anything like that.  If
we were to do that, the index would become much larger and less
efficient, whereas the approach of just optimizing the way our
existing indexes are used doesn't have that disadvantage.  It also
sounds like a lot of work. Now, if someone else wants to demonstrate
that it has advantages that are worth the costs and go do it, more
power to said person, but I'm unexcited about it.

> I don't know VM internals:
>
>  * do we have a counter of ALL_VISIBLE flag set on a relation ? (this
> should be very good for planner)
>  * do we need a pg_class.rel_vmvisible ?! (I have hands up, don't
> shoot pleeaase)

Evidently I'm developing a more frightening reputation than I would hope.  :-(

Anyway, yes, I do believe we need a table-level statistic for the
percentage of the visibility map bits that are believed to be set.
Having said that I think we need it, let me also say that I'm a bit
skeptical about how well it will work.  There are two problems:

1. Consider a query like "SELECT a, b FROM foo WHERE a = 1".  To
accurately estimate the cost of executing this query via an index-only
scan (on an index over foo (a, b)), we need to know (i) the percentage
of rows in the table for which a = 1 and (ii) the percentage *of those
rows* which are on an all-visible page.  We can assume that if 80% of
the rows in the table are on all-visible pages, then 80% of the rows
returned by this query will be on all-visible pages also, but that
might be wildly wrong.  This is similar to the problem of costing
"SELECT * FROM foo WHERE a = 1 AND b = 1" - we know the fraction of
rows where a = 1 and the fraction where b = 1, but there's no
certainty that multiplying those values will produce an accurate
estimate for the conjunction of those conditions.  The problem here is
not as bad as the general multi-column statistics problem because a
mistake will only bollix the cost, not the row count estimate, but
it's still not very nice.

2. Since VACUUM and ANALYZE often run together, we will be estimating
the percentage of rows on all-visible pages just at the time when that
percentage is highest.  This is not exactly wonderful, either...

I have a fair amount of hope that even with these problems we can come
up with some adjustment to the planner that is better than just
ignoring the problem, but I am not sure how difficult it will be.

>  * is it ok to parse VM for planning (I believe it is not) ?

It doesn't seem like a good idea to me, but I just work here.  I'm not
sure what that would buy us.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Cédric Villemain

Date:

19 June 2011, 16:14:18

2011/6/19 Robert Haas <robertmhaas@gmail.com>:
> On Sun, Jun 19, 2011 at 11:12 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>>> Good point --- we would be making plan decisions based on the visibility
>>> map coverage.  The big question is whether visibility map changes are
>>> more dynamic than the values we already plan against, like rows in the
>>> table, table size, and value distributions.  I don't know the answer.
>>
>> Robert, I though of Covered-Index as just a usage of the vis map:
>> don't take the heap block if not needed. This look easier to do and
>> better in the long term (because read-only DB may quickly turn into a
>> no-heap access DB for example). Thus this is not real covered-index.
>> Did you want to implement real covered-index and did you have ideas on
>> how to do that ? Or just an optimization of the current
>> planner/executor on index usage ?
>
> If by a "real" covered index you mean one that includes visibility
> info in the index - I have no plans to work on anything like that.  If
> we were to do that, the index would become much larger and less
> efficient, whereas the approach of just optimizing the way our
> existing indexes are used doesn't have that disadvantage.  It also
> sounds like a lot of work. Now, if someone else wants to demonstrate
> that it has advantages that are worth the costs and go do it, more
> power to said person, but I'm unexcited about it.

Yes I was thinking of that, and agree with you.

>
>> I don't know VM internals:
>>
>>  * do we have a counter of ALL_VISIBLE flag set on a relation ? (this
>> should be very good for planner)
>>  * do we need a pg_class.rel_vmvisible ?! (I have hands up, don't
>> shoot pleeaase)
>
> Evidently I'm developing a more frightening reputation than I would hope.  :-(

Nah, I was joking :) don't worry !
Probably because I have already proposing 1 new GUC and at least one
new column to pg_class recently. (and we're not used to change that
frequently)

>
> Anyway, yes, I do believe we need a table-level statistic for the
> percentage of the visibility map bits that are believed to be set.
> Having said that I think we need it, let me also say that I'm a bit
> skeptical about how well it will work.  There are two problems:
>
> 1. Consider a query like "SELECT a, b FROM foo WHERE a = 1".  To
> accurately estimate the cost of executing this query via an index-only
> scan (on an index over foo (a, b)), we need to know (i) the percentage
> of rows in the table for which a = 1 and (ii) the percentage *of those
> rows* which are on an all-visible page.  We can assume that if 80% of
> the rows in the table are on all-visible pages, then 80% of the rows
> returned by this query will be on all-visible pages also, but that
> might be wildly wrong.  This is similar to the problem of costing
> "SELECT * FROM foo WHERE a = 1 AND b = 1" - we know the fraction of
> rows where a = 1 and the fraction where b = 1, but there's no
> certainty that multiplying those values will produce an accurate
> estimate for the conjunction of those conditions.  The problem here is
> not as bad as the general multi-column statistics problem because a
> mistake will only bollix the cost, not the row count estimate, but
> it's still not very nice.
>
> 2. Since VACUUM and ANALYZE often run together, we will be estimating
> the percentage of rows on all-visible pages just at the time when that
> percentage is highest.  This is not exactly wonderful, either...
>
> I have a fair amount of hope that even with these problems we can come
> up with some adjustment to the planner that is better than just
> ignoring the problem, but I am not sure how difficult it will be.
>
>>  * is it ok to parse VM for planning (I believe it is not) ?
>
> It doesn't seem like a good idea to me, but I just work here.  I'm not
> sure what that would buy us.

All true, and I won't be unhappy to have the feature as a bonus, not
expected by the planner(for the cost part) but handled by the
executor.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: the big picture for index-only scans

From

Florian Pflug

Date:

19 June 2011, 18:10:59

On Jun19, 2011, at 20:40 , Robert Haas wrote:
> 2. Since VACUUM and ANALYZE often run together, we will be estimating
> the percentage of rows on all-visible pages just at the time when that
> percentage is highest.  This is not exactly wonderful, either...

Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than
VACUUM by default?

best regards,
Florian Pflug

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 June 2011, 18:16:21

On Sun, Jun 19, 2011 at 5:10 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jun19, 2011, at 20:40 , Robert Haas wrote:
>> 2. Since VACUUM and ANALYZE often run together, we will be estimating
>> the percentage of rows on all-visible pages just at the time when that
>> percentage is highest.  This is not exactly wonderful, either...
>
> Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than
> VACUUM by default?

The autoanalyze threshold is, by default, 10%; and the autovacuum
threshold, 20%.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Florian Pflug

Date:

19 June 2011, 20:59:53

On Jun19, 2011, at 23:16 , Robert Haas wrote:
> On Sun, Jun 19, 2011 at 5:10 PM, Florian Pflug <fgp@phlo.org> wrote:
>> On Jun19, 2011, at 20:40 , Robert Haas wrote:
>>> 2. Since VACUUM and ANALYZE often run together, we will be estimating
>>> the percentage of rows on all-visible pages just at the time when that
>>> percentage is highest.  This is not exactly wonderful, either...
>> 
>> Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than
>> VACUUM by default?
> 
> The autoanalyze threshold is, by default, 10%; and the autovacuum
> threshold, 20%.

Hm, so you could ignore (or rather dampen) the results of 
VACUUM+ANALYZE and rely on the ANALYZE-only runs to keep
the estimate correct. Still doesn't sound that bad...

best regards,
Florian Pflug

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 June 2011, 21:47:03

On Sun, Jun 19, 2011 at 7:59 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jun19, 2011, at 23:16 , Robert Haas wrote:
>> On Sun, Jun 19, 2011 at 5:10 PM, Florian Pflug <fgp@phlo.org> wrote:
>>> On Jun19, 2011, at 20:40 , Robert Haas wrote:
>>>> 2. Since VACUUM and ANALYZE often run together, we will be estimating
>>>> the percentage of rows on all-visible pages just at the time when that
>>>> percentage is highest.  This is not exactly wonderful, either...
>>>
>>> Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than
>>> VACUUM by default?
>>
>> The autoanalyze threshold is, by default, 10%; and the autovacuum
>> threshold, 20%.
>
> Hm, so you could ignore (or rather dampen) the results of
> VACUUM+ANALYZE and rely on the ANALYZE-only runs to keep
> the estimate correct. Still doesn't sound that bad...

Yeah, there are a lots of possible approaches.  You could try to keep
a count of how many visibility map bits had been cleared since the
last run...  and either adjust the estimate directly or use it to
trigger an ANALYZE (or some limited ANALYZE that only looks at
visibility map bits).  You could gather statistics on how often the
queries that are actually running are finding the relevant visibility
map bits set, and use that to plan future queries.  You could do what
you're suggesting... and there are probably other options as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

18 August 2011, 17:04:31

On Tue, May 10, 2011 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> >> Any thoughts welcome. ?Incidentally, if anyone else feels like working
>> >> on this, feel free to let me know and I'm happy to step away, from all
>> >> of it or from whatever part someone else wants to tackle. ?I'm mostly
>> >> working on this because it's something that I think we really need to
>> >> get done, more than having a burning desire to be the one who does it.
>> >
>> > Indexonly scans are welcome!
>> > I believe I can help on 3 and 4, but (really) not sure for 1 and 2.
>>
>> Well, I have code for #1, and just need reviews, and #2 shouldn't be
>> that hard, and with luck I'll twist Bruce's arm into doing it (*waves
>> to Bruce*).  So #3 and #4 are the next thing to tackle.  Any thoughts
>> on what/how you'd like to contribute there?
>
> I am happy to have pg_upgrade skip upgrading visibility map files --- it
> already has code to conditionally process them because they only exist
> in >= 8.4:
>
>        /* fsm/vm files added in PG 8.4 */
>        if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
>        {
>            /*
>             * Copy/link any fsm and vm files, if they exist
>             */
>
> Just give the word and it will be done.

I hereby give the word.  :-)

Specifically, we need to skip copying vm files (only) if coming from a
version prior to this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e16954f3d27fa8e16c379ff6623ae18d6250a39c

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

19 August 2011, 10:19:47

Note that we already have the visibility map, and the accesses needed to update it are already there. Granted, we'll have to change the logic slightly to make it crash safe, but I don't expect that to add any meaningful overhead - the changes are going to be where the bits are set, ie. vacuum, not when the bits are cleared. Granted, we might also want to set the bits more aggressively once they're used by index-only-scans. But done correctly, just taking advantage of the VM that's already there shouldn't add overhead to other operations.

I agree that we need to do tests to demonstrate that there's a gain from the patch, once we have a patch to test. I would be very surprised if there isn't, but that just means the testing is going to be easy.

--
Heikki Linnakangas

EnterpriseDB http://www.enterprisedb.com

I could see some arguments supporting this feature, citing covering indexes as example. But i just want to highlight they are not same. Visibility map approach is totally not related to the covering indexes approach, except the intention of avoiding the heap scan. Because of the small size, we will be having more contentions(People who have worked with Oracle can take the example of a bitmap index on a OLTP database). I was making the suggestion previously to make these crash safe visibility maps optional for a table, so that the overhead, which comes with it, can be avoided for those tables, which have queries that don't support index only scans. The fact that the proposal is for crash safe visibility map, to become a default package of any Postgresql table will definitely have wide ranging implications on OLTP performance.

Gokul.

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 August 2011, 11:17:12

On Fri, Aug 19, 2011 at 9:19 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> The fact that the
> proposal is for crash safe visibility map, to become a default package of
> any Postgresql table will definitely have wide ranging implications on OLTP
> performance.

Well, that would certainly be alarming if true, but I don't think it
is.  As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground process if the visibility map bit changes at almost
exactly the same time the process was about to insert, update, or
delete a tuple.

If someone comes up with a test where this overhead is enough to
measure, then we might need to rethink our whole approach.  Maybe we
would make it an optional feature, or maybe we would just rip it out
and start over with some sort of redesign, or maybe we would look for
other optimizations to counterbalance the additional overhead.  I
don't know.  But as far as I can see you're hypothesizing without
evidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Bruce Momjian

Date:

19 August 2011, 12:22:47

Robert Haas wrote:
> > I am happy to have pg_upgrade skip upgrading visibility map files --- it
> > already has code to conditionally process them because they only exist
> > in >= 8.4:
> >
> > ? ? ? ?/* fsm/vm files added in PG 8.4 */
> > ? ? ? ?if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
> > ? ? ? ?{
> > ? ? ? ? ? ?/*
> > ? ? ? ? ? ? * Copy/link any fsm and vm files, if they exist
> > ? ? ? ? ? ? */
> >
> > Just give the word and it will be done.
>
> I hereby give the word.  :-)
>
> Specifically, we need to skip copying vm files (only) if coming from a
> version prior to this commit:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e16954f3d27fa8e16c379ff6623ae18d6250a39c

Done with the attached, applied patch.  There was no cat-version bump
from that commit (because the format didn't change, just the
crash-safeness) so I picked the first cat-version change after this
commit.  This is only a pg_upgrade 9.2+ issue.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index 6def748..a19b3df
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
***************
*** 64,69 ****
--- 64,75 ----
  #define TABLE_SPACE_SUBDIRS_CAT_VER 201001111
  /* postmaster/postgres -b (binary_upgrade) flag added during PG 9.1 development */
  #define BINARY_UPGRADE_SERVER_FLAG_CAT_VER 201104251
+ /*
+  *     Visibility map changed with this 9.2 commit,
+  *    8f9fe6edce358f7904e0db119416b4d1080a83aa; pick later catalog version.
+  */
+ #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
+

  /*
   * Each relation is represented by a relinfo structure.
diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index d4a420f..df752c5
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 120,128 ****
      int            numFiles = 0;
      int            mapnum;
      int            fileno;
!
      old_dir[0] = '\0';

      for (mapnum = 0; mapnum < size; mapnum++)
      {
          char        old_file[MAXPGPATH];
--- 120,134 ----
      int            numFiles = 0;
      int            mapnum;
      int            fileno;
!     bool        vm_crashsafe_change = false;
!
      old_dir[0] = '\0';

+     /* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
+     if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
+         new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
+         vm_crashsafe_change = true;
+
      for (mapnum = 0; mapnum < size; mapnum++)
      {
          char        old_file[MAXPGPATH];
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 168,175 ****

              for (fileno = 0; fileno < numFiles; fileno++)
              {
                  if (strncmp(namelist[fileno]->d_name, scandir_file_pattern,
!                             strlen(scandir_file_pattern)) == 0)
                  {
                      snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
                               namelist[fileno]->d_name);
--- 174,189 ----

              for (fileno = 0; fileno < numFiles; fileno++)
              {
+                 char *vm_offset = strstr(namelist[fileno]->d_name, "_vm");
+                 bool is_vm_file = false;
+
+                 /* Is a visibility map file? (name ends with _vm) */
+                 if (vm_offset && strlen(vm_offset) == strlen("_vm"))
+                     is_vm_file = true;
+
                  if (strncmp(namelist[fileno]->d_name, scandir_file_pattern,
!                             strlen(scandir_file_pattern)) == 0 &&
!                     (!is_vm_file || !vm_crashsafe_change))
                  {
                      snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
                               namelist[fileno]->d_name);

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 August 2011, 12:24:51

On Fri, Aug 19, 2011 at 11:22 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> > I am happy to have pg_upgrade skip upgrading visibility map files --- it
>> > already has code to conditionally process them because they only exist
>> > in >= 8.4:
>> >
>> > ? ? ? ?/* fsm/vm files added in PG 8.4 */
>> > ? ? ? ?if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
>> > ? ? ? ?{
>> > ? ? ? ? ? ?/*
>> > ? ? ? ? ? ? * Copy/link any fsm and vm files, if they exist
>> > ? ? ? ? ? ? */
>> >
>> > Just give the word and it will be done.
>>
>> I hereby give the word.  :-)
>>
>> Specifically, we need to skip copying vm files (only) if coming from a
>> version prior to this commit:
>>
>> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e16954f3d27fa8e16c379ff6623ae18d6250a39c
>
> Done with the attached, applied patch.  There was no cat-version bump
> from that commit (because the format didn't change, just the
> crash-safeness) so I picked the first cat-version change after this
> commit.  This is only a pg_upgrade 9.2+ issue.

Thanks!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

19 August 2011, 15:01:44

Well, that would certainly be alarming if true, but I don't think it
is. As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground process if the visibility map bit changes at almost
exactly the same time the process was about to insert, update, or
delete a tuple.

Let's forget the overhead posed by vacuum. Can you please point me to the design which talks in detail of the second overhead?

Thanks.

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

19 August 2011, 15:06:43

Well, that would certainly be alarming if true, but I don't think it
is. As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground process if the visibility map bit changes at almost
exactly the same time the process was about to insert, update, or
delete a tuple.

Let's forget the overhead posed by vacuum. Can you please point me to the design which talks in detail of the second overhead?

Thanks.

If you are following the same design that Heikki put forward, then there is a problem with it in maintaining the bits in page and the bits in visibility map in sync, which we have already discussed.

Re: the big picture for index-only scans

From

Heikki Linnakangas

Date:

19 August 2011, 15:25:37

On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:
> If you are following the same design that Heikki put forward, then there is
> a problem with it in maintaining the bits in page and the bits in visibility
> map in sync, which we have already discussed.

Are you referring to this: 
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I 
believe Robert's changes to make the visibility map crash-safe covers 
that. Clearing the bit in the visibility map now happens within the same 
critical section as clearing the flag on the heap page and writing th 
WAL record.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

19 August 2011, 15:51:17

On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:
If you are following the same design that Heikki put forward, then there is
a problem with it in maintaining the bits in page and the bits in visibility
map in sync, which we have already discussed.

Are you referring to this: http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing th WAL record.

In that case, say a 100 sessions are trying to update records which fall under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 * 4 exact) covered by one page of visibility map, won't it make the 99 sessions wait for that visibility map while holding the exclusive lock on the 99 heap pages?

Are we going to say, that these kind of situations occur very rarely? Or that the loss of scalability in these situations, is worth the performance during the read-heavy workloads?

In any case, making a database going through all these extra overheads, while they don't even have any index-only scans!!! That definitely should be given a thought.

Gokul.

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

19 August 2011, 15:53:18

On Sat, Aug 20, 2011 at 2:51 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote:

On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:
If you are following the same design that Heikki put forward, then there is
a problem with it in maintaining the bits in page and the bits in visibility
map in sync, which we have already discussed.

Are you referring to this: http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing th WAL record.

In that case, say a 100 sessions are trying to update records which fall under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 * 4 exact) covered by one page of visibility map, won't it make the 99 sessions wait for that visibility map while holding the exclusive lock on the 99 heap pages?
Are we going to say, that these kind of situations occur very rarely? Or that the loss of scalability in these situations, is worth the performance during the read-heavy workloads?

In any case, making a database going through all these extra overheads, while they don't even have any index-only scans!!! That definitely should be given a thought.

Gokul.

Please consider the update, i mentioned above occurs and changes the visibility bit of the page.

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 August 2011, 17:02:12

On Fri, Aug 19, 2011 at 2:51 PM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>>
>> On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:
>>>
>>> If you are following the same design that Heikki put forward, then there
>>> is
>>> a problem with it in maintaining the bits in page and the bits in
>>> visibility
>>> map in sync, which we have already discussed.
>>
>> Are you referring to this:
>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I
>> believe Robert's changes to make the visibility map crash-safe covers that.
>> Clearing the bit in the visibility map now happens within the same critical
>> section as clearing the flag on the heap page and writing th WAL record.
>>
> In that case, say a 100 sessions are trying to update records which fall
> under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 *
> 4 exact) covered by one page of visibility map,

There are about 8000 visibility map bytes per page, so about 64000
bits, each covering one page.  So a visibility map page covers about
512MB of heap.

> won't it make the 99
> sessions wait for that visibility map while holding the exclusive lock on
> the 99 heap pages?

Hmm, you have a point.  If 100 backends simultaneously write to 100
different pages, and all of those pages are all-visible, then it's
possible that they could end up fighting over the buffer content lock
on the visibility map page.  But why would you expect that to matter?
In a heavily updated table, the proportion of visibility map bits that
are set figures to be quite low, since they're only set during VACUUM.To have 100 backends simultaneously pick
differentpages to write

each of which is all-visible seems really unlucky.   Even if it does
happen from time to time, I suspect the effects would be largely
masked by WALInsertLock contention.  The visibility map content lock
is only taken very briefly, whereas the operations protected by
WALInsertLock are much more complex.

This does, however, remind me of two other points:

1. Heikki's idea of trying to set visibility map bits more
aggressively is probably a good one, but it would be possible to
overdo it, because setting visibility map bits is not free. It has an
immediate cost - in that we have to write xlog - and a deferred cost -
in that it will impose overhead when those pages are re-dirtied.  At
the moment, I think we're probably too far in the opposite direction -
i.e. we leave the visibility map bits unset for too long, leading to a
massive amount of deferred work that gets done all at once when VACUUM
finally runs.  But we shouldn't overcorrect.

2. While we're tinkering with the visibility map, we should think
about whether it makes sense to carve out some more bits for such
purposes as we may in the future require.  Even if we allowed each
heap page a byte in the visibility map instead of a single bit, the
visibility map would still be roughly 1000 times smaller than the
heap; and if there are any situations where the page-level locks
become choke points, this would mitigate that effect.  There might
also be some advantage in that bytes can be atomically set, while bits
can't, although I can't immediately think how we'd leverage that.
Alternatively, we could widen the field to something less than a full
byte, like 2 or 4 bits, if that seems better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

19 August 2011, 17:17:45

On Fri, Aug 19, 2011 at 4:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, you have a point.  If 100 backends simultaneously write to 100
> different pages, and all of those pages are all-visible, then it's
> possible that they could end up fighting over the buffer content lock
> on the visibility map page.  But why would you expect that to matter?
> In a heavily updated table, the proportion of visibility map bits that
> are set figures to be quite low, since they're only set during VACUUM.
>  To have 100 backends simultaneously pick different pages to write
> each of which is all-visible seems really unlucky.   Even if it does
> happen from time to time, I suspect the effects would be largely
> masked by WALInsertLock contention.  The visibility map content lock
> is only taken very briefly, whereas the operations protected by
> WALInsertLock are much more complex.

Oh, snap.  I see another possible problem here.

At the time visibilitymap_clear() is called, we're already (and
necessarily) holding a content lock on the buffer.  And then we go get
a content lock on the visibility map page, whose buffer number might
be higher or lower than that of the heap page, possibly leading us to
violate the rule the buffer content locks must be taken increasing
buffer number order.

Maybe that's OK, because I can't see that we'd ever acquire any other
buffer content lock while already holding a lock on the visibility map
buffer.  But given this logic, if we did do such a thing, it could
result in an undetected deadlock.

Hmm....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Heikki Linnakangas

Date:

20 August 2011, 01:37:02

On 19.08.2011 23:02, Robert Haas wrote:
> On Fri, Aug 19, 2011 at 2:51 PM, Gokulakannan Somasundaram
> <gokul007@gmail.com>  wrote:
>> won't it make the 99
>> sessions wait for that visibility map while holding the exclusive lock on
>> the 99 heap pages?
>
> Hmm, you have a point.  If 100 backends simultaneously write to 100
> different pages, and all of those pages are all-visible, then it's
> possible that they could end up fighting over the buffer content lock
> on the visibility map page.  But why would you expect that to matter?
> In a heavily updated table, the proportion of visibility map bits that
> are set figures to be quite low, since they're only set during VACUUM.
>   To have 100 backends simultaneously pick different pages to write
> each of which is all-visible seems really unlucky.   Even if it does
> happen from time to time, I suspect the effects would be largely
> masked by WALInsertLock contention.  The visibility map content lock
> is only taken very briefly, whereas the operations protected by
> WALInsertLock are much more complex.

The above could already happen in 8.4, where the visibility map was 
introduced. The contention on the VM buffer would be just as bad whether 
you hold the heap page lock at the same time or not. I have not heard 
any complaints of contention on VM buffers.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: the big picture for index-only scans

From

Heikki Linnakangas

Date:

20 August 2011, 01:44:34

On 19.08.2011 23:17, Robert Haas wrote:
> On Fri, Aug 19, 2011 at 4:02 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> Hmm, you have a point.  If 100 backends simultaneously write to 100
>> different pages, and all of those pages are all-visible, then it's
>> possible that they could end up fighting over the buffer content lock
>> on the visibility map page.  But why would you expect that to matter?
>> In a heavily updated table, the proportion of visibility map bits that
>> are set figures to be quite low, since they're only set during VACUUM.
>>   To have 100 backends simultaneously pick different pages to write
>> each of which is all-visible seems really unlucky.   Even if it does
>> happen from time to time, I suspect the effects would be largely
>> masked by WALInsertLock contention.  The visibility map content lock
>> is only taken very briefly, whereas the operations protected by
>> WALInsertLock are much more complex.
>
> Oh, snap.  I see another possible problem here.
>
> At the time visibilitymap_clear() is called, we're already (and
> necessarily) holding a content lock on the buffer.  And then we go get
> a content lock on the visibility map page, whose buffer number might
> be higher or lower than that of the heap page, possibly leading us to
> violate the rule the buffer content locks must be taken increasing
> buffer number order.

Huh? The rule is that you have to acquire locks on heap pages in 
increasing page number order. That doesn't apply to the order between 
the heap and the visibility map. The rule we've established for that is 
that you have to acquire the lock on the heap page first, before locking 
the corresponding vm page. It would be good to add a comment about that 
to the header comment of RelationGetBufferForTuple(), there doesn't seem 
to be anything about the visibility map buffer arguments there.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

20 August 2011, 05:49:03

The above could already happen in 8.4, where the visibility map was introduced. The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not. I have not heard any complaints of contention on VM buffers.

--
Heikki Linnakangas

a) First of all, it(Visibility Map) should have definitely affected the scalability of postgres in scenarios where in updates occur during a time batch window. May be the increase in speed of vacuums negate that effect.
b) Second, currently the index scans don't touch the visibility map and in future they are going to acquire share lock on that. This should increase the contention.
c) Your statement : "The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not."

I am talking about the contention time frame of the heap page. It will be increased Consider my statement in conjunction with the scenario 2.
d) In addition, currently there is no WAL Logging, while the bit is cleared, which would not be the case in future and hence the exclusive lock held on the visibility map is going to be held for a longer time.

You might definitely see some performance improvement, if you are testing this in anything less than 4 cores. Bump up the core count and processor count and check whether this affects the load-throughput curve.

Thanks,
Gokul.

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

20 August 2011, 05:57:15

Hmm, you have a point. If 100 backends simultaneously write to 100
different pages, and all of those pages are all-visible, then it's
possible that they could end up fighting over the buffer content lock
on the visibility map page. But why would you expect that to matter?
In a heavily updated table, the proportion of visibility map bits that
are set figures to be quite low, since they're only set during VACUUM.
To have 100 backends simultaneously pick different pages to write
each of which is all-visible seems really unlucky. Even if it does
happen from time to time, I suspect the effects would be largely
masked by WALInsertLock contention. The visibility map content lock
is only taken very briefly, whereas the operations protected by
WALInsertLock are much more complex.

by your argument, if WALInserLock is held for 't' seconds, you should definitely be holding visibility map lock for more than time frame 't' . So the index scans have to wait for acquiring the lock on visibility map to check visibility. What we are trading off here is Synchronization Vs I/O. Should we lose the scalability for performance??

Gokul.

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

20 August 2011, 06:06:48

On Sat, Aug 20, 2011 at 4:48 PM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote:

The above could already happen in 8.4, where the visibility map was introduced. The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not. I have not heard any complaints of contention on VM buffers.

--
Heikki Linnakangas

a) First of all, it(Visibility Map) should have definitely affected the scalability of postgres in scenarios where in updates occur during a time batch window. May be the increase in speed of vacuums negate that effect.
b) Second, currently the index scans don't touch the visibility map and in future they are going to acquire share lock on that. This should increase the contention.
c) Your statement : "The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not."
I am talking about the contention time frame of the heap page. It will be increased Consider my statement in conjunction with the scenario 2.
d) In addition, currently there is no WAL Logging, while the bit is cleared, which would not be the case in future and hence the exclusive lock held on the visibility map is going to be held for a longer time.

You might definitely see some performance improvement, if you are testing this in anything less than 4 cores. Bump up the core count and processor count and check whether this affects the load-throughput curve.

By your argument, we can say that no-one will create a index with a function like random(), time(), date(), broken operators etc. Its hard to imagine a index in which a a user will only want to insert and never select on it. Even C++ provides std::map infrastructure for objects, where the user owns the responsibility of writing the operator< correctly. Other databases do the same. Even going one step ahead, we are already going back to such indexes, if there is unique constraint/ referential integrity constraints. But with all such caveats, it was decided we should not allow covering indexes, as they require going back to the index for updates/deletes.

Re: the big picture for index-only scans

From

Robert Haas

Date:

20 August 2011, 09:05:40

On Sat, Aug 20, 2011 at 4:48 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> a) First of all, it(Visibility Map) should have definitely affected the
> scalability of postgres in scenarios where in updates occur during a time
> batch window. May be the increase in speed of vacuums negate that effect.

I think that you have switched gears here and are in this paragraph
talking about the 8.4-era visibility map changes rather than my recent
work.  There is zero evidence that those changes caused any
performance problem.  I've spent a large chunk of the last four months
looking for scalability problems and I haven't come across any
evidence that this is an issue.  If you think it is, let's have a test
case.

> b) Second, currently the index scans  don't touch the visibility map and in
> future they are going to acquire share lock on that. This should increase
> the contention.

Maybe.  Of course, we're only talking about cases where the index-only
scan optimization is in use (which is not all of them).  And even
then, if bringing in the heap pages would have caused buffer evictions
and the index-only scan avoids that, then you're trading contention
for exclusive locks on the BufFreelistLock and buf mapping locks for
shared-lock contention on the visibility map page.  Furthermore, the
latter will (except in very large relations) only need to be
shared-locked and pinned once per scan, whereas you might easily have
a case where each index probe forces replacement of a heap page.

What I am slightly worried about is that our machinery for taking and
releasing buffer pins is going to become a scalability bottleneck at
some point, and certainly very hot pages like root index pages and
visibility map pages could become hot-spots for such contention.  But
the experiments I've done so far suggest that there are other more
serious bottlenecks that have to be addressed first.  If it does rise
to the list, we'll find a way to fix it, but I think skipping the
index-only scan optimization is going to be a cure worse than the
disease.

> c) Your statement : "The contention on the VM buffer would be just as bad
> whether you hold the heap page lock at the same time or not."
> I am talking about the contention time frame of the heap page. It will be
> increased Consider my statement in conjunction with the scenario 2.

Sure, but here again, what is your evidence that this actually
matters?  It's not increasing by very much.

> d) In addition, currently there is no WAL Logging, while the bit is cleared,
> which would not be the case in future and hence the exclusive lock held on
> the visibility map is going to be held for a longer time.

This is false and has been false since the visibility map was first implemented.

> You might definitely see some performance improvement, if you are testing
> this in anything less than 4 cores. Bump up the core count and processor
> count and check whether this affects the load-throughput curve.

I'm fairly certain I already did that experiment, on a 32-core
machine, but since the patch you're worried about went in two months
ago, I'm a little fuzzy on the details.  Maybe you should test it out
yourself....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

20 August 2011, 09:06:38

On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> by your argument, if WALInserLock is held for 't' seconds, you should
> definitely be holding visibility map lock for more than time frame 't'.

Nope, that's not how it works.  Perhaps you should read the code.
See, e.g., heap_update().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Robert Haas

Date:

20 August 2011, 09:15:11

On Sat, Aug 20, 2011 at 5:06 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> By your argument, we can say that no-one will create a index with a function
> like random(), time(), date(), broken operators etc. Its hard to imagine a
> index in which a a user will only want to insert and never select on it.

The whole point of this optimization is to make index access cheaper,
not more expensive.  You seem convinced it's done the opposite, but
you haven't offered much evidence (or any test results) to support
that proposition.

> Even C++ provides std::map infrastructure for objects, where the user owns
> the responsibility of writing the operator< correctly. Other databases do
> the same. Even going one step ahead, we are already going back to such
> indexes, if there is unique constraint/ referential integrity constraints.
> But with all such caveats, it was decided we should not allow covering
> indexes, as they require going back to the index for updates/deletes.

What we decided NOT to do is put xmin/xmax/cid into the index tuple,
for precisely the reasons you mention.  That would be catastrophic
both for write operations to the table, and for the size of the index.This approach is appealing precisely because a
singlevisibility map

page can cover a very large chunk of the heap.  Is there a potential
problem buried somewhere in there?  Maybe.  But if there is, I'm
pretty sure it's something far less obvious than what you seem to be
worried about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

21 August 2011, 01:10:13

I think that you have switched gears here and are in this paragraph

talking about the 8.4-era visibility map changes rather than my recent
work. There is zero evidence that those changes caused any
performance problem. I've spent a large chunk of the last four months
looking for scalability problems and I haven't come across any
evidence that this is an issue. If you think it is, let's have a test
case.

Consider the TPC-C benchmark. Currently on a four core box, Oracle does 290000 transactions per minute. Say we are doing something around half of this. So a page should get quickly filled. If a vacuum runs and the transactions are not touched by the MakePayment transaction, then it will be marked as AllVisible. When the MakePayment transaction updates, it should clear that bit. If we test it out, with too little warehouses, we may not see the effect. Of Course the PGPROC infrastructure for generating transaction ids might be the biggest culprit, but if you ignore that this should be visible.

Maybe. Of course, we're only talking about cases where the index-only
scan optimization is in use (which is not all of them).

But are you saying that, the optimization of looking at visibility map will happen only for Index-only scans and will not use the visibility map infrastructure for the normal index scans? That's definitely a good idea and improvement.

>> d) In addition, currently there is no WAL Logging, while the bit is cleared,
>> which would not be the case in future and hence the exclusive lock held on
>> the visibility map is going to be held for a longer time.

> This is false and has been false since the visibility map was first implemented.

I can't understand this. If you are not doing this, then it would cause consistency issues. Are you saying, we have a crash safe visibility map, but you don't follow "log the change before changing the contents"/ WAL principle. If so, please explain in detail. If you are doing it in the normal way, then you should be logging the changes before making the changes to the buffer and during that timeframe, you should be holding the lock on the buffer. Heikki specifically pointed out, that you have brought in the WAL Logging of visibility map, within the critical section.

Heikki's comments, i am quoting:
"I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing the WAL record."

I would be able to respond to your other statements, once we are clear here.

Thanks,
Gokul.

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

21 August 2011, 01:41:44

On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram

<gokul007@gmail.com> wrote:
> by your argument, if WALInserLock is held for 't' seconds, you should
> definitely be holding visibility map lock for more than time frame 't'.

Nope, that's not how it works. Perhaps you should read the code.
See, e.g., heap_update().

--

OK. I took a look at the patch you have supplied in http://postgresql.1045698.n5.nabble.com/crash-safe-visibility-map-take-five-td4377235.html
There is a code like this.

     {
         all_visible_cleared = true;
         PageClearAllVisible(BufferGetPage(buffer));
+        visibilitymap_clear(relation,
+                            ItemPointerGetBlockNumber(&(heaptup->t_self)),
+                            &vmbuffer);
     }

Here, you are not making an entry into the WAL. then there is an assumption that the two bits will be in sync without any WAL entry. There is a chance that the visibility map might be affected by partial page writes, where clearing of a particular bit might not have been changed. And i am thinking a lot of such issues. Can you just explain the background logic behind ignoring the principle of WAL logging? What are the implemented principles, that protect the Visibility map pages??

Thanks,
Gokul.

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

21 August 2011, 01:55:10

> By your argument, we can say that no-one will create a index with a function
> like random(), time(), date(), broken operators etc. Its hard to imagine a
> index in which a a user will only want to insert and never select on it.

The whole point of this optimization is to make index access cheaper,
not more expensive. You seem convinced it's done the opposite, but
you haven't offered much evidence (or any test results) to support
that proposition.

I hope you are referring to thick indexes/covering indexes/indexes with snapshot. Why do you think its done the opposite? In fact all the other databases like Oracle, SQL-Server, Sybase implement the indexes with snapshot (that's the only one they support). It makes the index tuple larger by 8 bytes, but avoids the heap-fetch. I think, i ran a couple of benchmarks, when i submitted the patch and showed the improvement. The trade-off in that case was simple. Size of the index Vs avoiding a disk I/O. User still has the choice of creating indexes without snapshot( it was provided as an optional index).

What we decided NOT to do is put xmin/xmax/cid into the index tuple,
for precisely the reasons you mention. That would be catastrophic
both for write operations to the table, and for the size of the index.

Why it would be catastrophic for write operations on table?? -please explain me.
The trade-off in that case was simple. Size of the index Vs avoiding a disk I/O. There was no catastrophic damage on the size of the index, as far as i can see.

I made this point, because Heikki pointed out that since no-one is complaining about some performance problem, and so we can assume that it doesn't exist. But the thick index proposal was shot down on the context, some one might create a index on a function like random(), time(). date() or with broken operators, effectively meaning that you can insert into the index and cannot select back. We are already doing unique checks and referential integrity checks on that kind of indexes(which would all be wrong), but still we should not be working in that area, to help user not make that mistake of creating such indexes. So we should apply the same principle for decision making here also.

Gokul.

Re: the big picture for index-only scans

From

Heikki Linnakangas

Date:

21 August 2011, 03:58:08

On 21.08.2011 07:10, Gokulakannan Somasundaram wrote:
>>> d) In addition, currently there is no WAL Logging, while the bit is
> cleared,
>>> which would not be the case in future and hence the exclusive lock held
> on
>>> the visibility map is going to be held for a longer time.
>
>> This is false and has been false since the visibility map was first
> implemented.
>
> I can't understand this. If you are not doing this, then it would cause
> consistency issues. Are you saying, we have a crash safe visibility map, but
> you don't follow "log the change before changing the contents"/ WAL
> principle. If so, please explain in detail. If you are doing it in the
> normal way, then you should be logging the changes before making the changes
> to the buffer and during that timeframe, you should be holding the lock on
> the buffer. Heikki specifically pointed out, that you have brought in the
> WAL Logging of visibility map, within the critical section.

I think you two are talking slightly past each other. There is no extra 
WAL record written when a bit is cleared in the visibility map, there is 
just a flag in the WAL record of the heap insert/update/delete. That is 
what Robert was trying to explain, that part hasn't changed since 8.4. 
What *did* change, however, in master, when the visibility map was made 
crash-safe, is the duration the lock on the visibility map page is held. 
Before that, the visibility map page was locked only briefly *after* the 
changes to the heap page were already applied and WAL record written. 
Now, the VM page lock is acquired and released at the same time as the 
lock on the heap page. It's held while the heap page changes are made 
and WAL record is written. I believe that's what Gokulakannan was trying 
to point out, and is worried that you might get contention on the VM 
page lock now because it's held for a much longer duration.

Gokulakannan, if you could come up with a test case that demonstrates 
that contention (or the lack thereof), that would be good. Otherwise 
we're just speculating.

If it's an issue, perhaps we could release the VM page lock early. We're 
not updating the LSN on it, so we don't need to wait for the WAL record 
to be written, I think. It's a bit out of the ordinary, though, so I 
wouldn't like to do that without an actual test case that shows it's an 
issue.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: the big picture for index-only scans

From

Heikki Linnakangas

Date:

21 August 2011, 04:13:30

On 21.08.2011 07:41, Gokulakannan Somasundaram wrote:
> On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram
>
>> <gokul007@gmail.com>  wrote:
>>> by your argument, if WALInserLock is held for 't' seconds, you should
>>> definitely be holding visibility map lock for more than time frame 't'.
>>
>> Nope, that's not how it works.  Perhaps you should read the code.
>> See, e.g., heap_update().
>>
>> --
>>
> OK. I took a look at the patch you have supplied in
> http://postgresql.1045698.n5.nabble.com/crash-safe-visibility-map-take-five-td4377235.html

Here's the patch as it was committed: 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=503c7305a1e379f95649eef1a694d0c1dbdc674a

> There is a code like this.
>
>       {
>           all_visible_cleared = true;
>           PageClearAllVisible(BufferGetPage(buffer));
> +        visibilitymap_clear(relation,
> +                            ItemPointerGetBlockNumber(&(heaptup->t_self)),
> +&vmbuffer);
>       }
>
> Here, you are not making an entry into the WAL. then there is an assumption
> that the two bits will be in sync without any WAL entry. There is a chance
> that the visibility map might be affected by partial page writes, where
> clearing of a particular bit might not have been changed. And i am thinking
> a lot of such issues. Can you just explain the background logic behind
> ignoring the principle of WAL logging? What are the implemented principles,
> that protect the Visibility map pages??

The all_visible_cleared flag is included in the WAL record of the insert 
(or update or delete). Partial page writes are not a problem, because we 
always fetch the VM page and clear the bit, regardless of the LSN on the 
VM page.

PS. Robert, the LOCKING section in the header comment of visibilitymap.c 
is out-of-date: it claims that the VM bit is cleared after releasing the 
lock on the heap page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

21 August 2011, 04:52:34

>> The all_visible_cleared flag is included in the WAL record of the insert (or update or delete). Partial page
writesare not a problem, because we <br />>> always fetch the VM page and clear the bit, regardless of the LSN on
theVM page.<br /><br /><br />Two things <br />a) First, my understanding of checkpoint is that it flushes all the
pages,that got changed below a particular LSN. If we are not having a LSN in the visibility map, how we will be sure,
thata visibility map page is getting flushed/not? With the following approach, i can see only one issue. If the heap
pagegets written and checkpointed and the visibility map doesn't get synced during the checkpoint, then there is an
issue.Can you please explain me, how we get the assurance?<br /><br />b) Even if we have a contention issue, Visibility
mapis a solution for a considerable number of database scenarios. But it should not become a default package. A table,
withno chance of index-only scans should not incur the extra overhead of crash safe visibility maps. Those tables
shouldbe sparred from this extra overhead, as they don't have index only scans.<br /><br />Gokul.<br />

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

21 August 2011, 05:18:51

a) First, my understanding of checkpoint is that it flushes all the pages, that got changed below a particular LSN. If we are not having a LSN in the visibility map, how we will be sure, that a visibility map page is getting flushed/not?

Please ignore this statement. I confused between the checkpoints implemented in Postgres and some other database.

Gokul.

Re: the big picture for index-only scans

From

Robert Haas

Date:

21 August 2011, 18:38:59

On Sun, Aug 21, 2011 at 12:10 AM, Gokulakannan Somasundaram
<gokul007@gmail.com> wrote:
> Consider the TPC-C benchmark. Currently on a four core box, Oracle does
> 290000 transactions per minute. Say we are doing something around half of
> this. So a page should get quickly filled. If a vacuum runs and the
> transactions are not touched by the MakePayment transaction, then it will be
> marked as AllVisible. When the MakePayment transaction updates, it should
> clear that bit. If we test it out, with too little warehouses, we may not
> see the effect. Of Course the PGPROC infrastructure for generating
> transaction ids might be the biggest culprit, but if you ignore that this
> should be visible.

Actual testing does not appear to show a bottleneck in either of these areas.

>> Maybe.  Of course, we're only talking about cases where the index-only
>> scan optimization is in use (which is not all of them).
>
> But are you saying that, the optimization of looking at visibility map will
> happen only for Index-only scans and will not use the visibility map
> infrastructure for the normal index scans? That's definitely a good idea and
> improvement.

Well, yes.  And not only that, but the index-only scans patch has been
posted on this very mailing list, along with detailed submission
notes.  So you could go read, or try it, before jumping to the
conclusion that it doesn't work.

>>> d) In addition, currently there is no WAL Logging, while the bit is
>>> cleared,
>>> which would not be the case in future and hence the exclusive lock held
>>> on
>>> the visibility map is going to be held for a longer time.
>
>> This is false and has been false since the visibility map was first
>> implemented.
>
> I can't understand this. If you are not doing this, then it would cause
> consistency issues. Are you saying, we have a crash safe visibility map, but
> you don't follow "log the change before changing the contents"/ WAL
> principle. If so, please explain in detail. If you are doing it in the
> normal way, then you should be logging the changes before making the changes
> to the buffer and during that timeframe, you should be holding the lock on
> the buffer. Heikki specifically pointed out, that you have brought in the
> WAL Logging of visibility map, within the critical section.
>
> Heikki's comments, i am quoting:
>  "I believe Robert's changes to make the visibility map crash-safe covers
> that. Clearing the bit in the visibility map now happens within the same
> critical section as clearing the flag on the heap page and writing the WAL
> record."
>
> I would be able to respond to your other statements, once we are clear here.

There are extensive comments on this in visibilitymap.c and, in
heapam.c, heap_xlog_visible().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: the big picture for index-only scans

From

Gokulakannan Somasundaram

Date:

24 August 2011, 07:20:05

There are extensive comments on this in visibilitymap.c and, in
heapam.c, heap_xlog_visible().

I went through the design again and again. I am convinced that this should not have any functional bugs and should not cause much performance issues.

Nice thoughts on bypassing the WAL Logging.

Gokul.

Re: the big picture for index-only scans

From

Robert Haas

Date:

27 September 2011, 10:41:04

On Sun, Aug 21, 2011 at 3:13 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> PS. Robert, the LOCKING section in the header comment of visibilitymap.c is
> out-of-date: it claims that the VM bit is cleared after releasing the lock
> on the heap page.

Fixed, along with your other observation a couple of emails upthread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company