Thread: Index-only scan performance regression

Index-only scan performance regression

From
Dean Rasheed
Date:
Given a freshly created table (not vacuumed), and a query that uses an
index-only scan, for example:

CREATE TABLE foo(a int PRIMARY KEY);
INSERT INTO foo SELECT * FROM generate_series(1,1000000);
ANALYSE foo;

EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
time=23.646..23.646 rows=1 loops=1)
   ->  Index Only Scan using foo_pkey on foo  (cost=0.00..300.42
rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
         Index Cond: (a <= 10000)
         Heap Fetches: 10000
 Total runtime: 23.673 ms
(5 rows)


the actual performance is much worse than the equivalent index scan,
as used in 9.1 and earlier:

SET enable_indexonlyscan = off;
EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
time=3.219..3.219 rows=1 loops=1)
   ->  Index Scan using foo_pkey on foo  (cost=0.00..300.42 rows=8975
width=0) (actual time=0.014..2.302 rows=10000 loops=1)
         Index Cond: (a <= 10000)
 Total runtime: 3.240 ms
(4 rows)


Obviously this is the worst-case for an index-only scan, since there
is no visibility map, and it has to fetch each tuple from the heap,
but ideally this should perform around the same as an ordinary index
scan, since it's doing pretty much the same work.

Digging around, it looks like the additional cost is coming from
visibilitymap_test(), which is calling smgrexists() for each tuple, to
see if the visibility map file has been created. So it's doing a file
access check for each row, while the visibility map doesn't exist.

I'm not really familiar with this code, but a possible fix seems to be
to send an invalidation message in vm_extend() when it creates or
extends the visibility map, so that vm_readbuf() only has to re-check
the visibility map file if smgr_vm_nblocks has been invalidated. With
this change (attached) the worst-case index-only scan time becomes
around the same as the index scan time.

Regards,
Dean

Attachment

Re: Index-only scan performance regression

From
Robert Haas
Date:
On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> Given a freshly created table (not vacuumed), and a query that uses an
> index-only scan, for example:
>
> CREATE TABLE foo(a int PRIMARY KEY);
> INSERT INTO foo SELECT * FROM generate_series(1,1000000);
> ANALYSE foo;
>
> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;
>
>                                                            QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
> time=23.646..23.646 rows=1 loops=1)
>   ->  Index Only Scan using foo_pkey on foo  (cost=0.00..300.42
> rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
>         Index Cond: (a <= 10000)
>         Heap Fetches: 10000
>  Total runtime: 23.673 ms
> (5 rows)
>
>
> the actual performance is much worse than the equivalent index scan,
> as used in 9.1 and earlier:
>
> SET enable_indexonlyscan = off;
> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;
>
>                                                         QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
> time=3.219..3.219 rows=1 loops=1)
>   ->  Index Scan using foo_pkey on foo  (cost=0.00..300.42 rows=8975
> width=0) (actual time=0.014..2.302 rows=10000 loops=1)
>         Index Cond: (a <= 10000)
>  Total runtime: 3.240 ms
> (4 rows)
>
>
> Obviously this is the worst-case for an index-only scan, since there
> is no visibility map, and it has to fetch each tuple from the heap,
> but ideally this should perform around the same as an ordinary index
> scan, since it's doing pretty much the same work.

Yeah.

> Digging around, it looks like the additional cost is coming from
> visibilitymap_test(), which is calling smgrexists() for each tuple, to
> see if the visibility map file has been created. So it's doing a file
> access check for each row, while the visibility map doesn't exist.
>
> I'm not really familiar with this code, but a possible fix seems to be
> to send an invalidation message in vm_extend() when it creates or
> extends the visibility map, so that vm_readbuf() only has to re-check
> the visibility map file if smgr_vm_nblocks has been invalidated. With
> this change (attached) the worst-case index-only scan time becomes
> around the same as the index scan time.

I think that the additional sinval message might not do much good,
because it won't necessarily be seen until the next time any given
other backend does AcceptInvalidationMessages(), which might be a
while.  That's not an issue for truncation because it requires
AccessExclusiveLock, so we know that any subsequent access to the
table will involve taking a heavyweight lock, which will
AcceptInvalidationMessages().  Making vismap extension require
AccessExclusiveLock would be a cure worse than the disease.

In the case of an index-only scan, it wouldn't be the end of the world
if we failed to notice that the relation had been extended.  The worst
case would be that we wouldn't read the visibility map buffer and, as
you say, that shouldn't be any worse than a normal index scan.  The
only time we had better not fail to notice that the visibility map has
been extended is when we're asked to clear a bit, because failure to
notice that the new page is there - with a set bit that must now be
cleared - could result in queries returning wrong answers.  So one
possibility would be to tweak the relcache stuff to have a flag
indicating whether the value is cached, and make that separate from
the cached value.  Then, if the value is cached, and we're doing
anything other than clearing a bit, we just say, eh, sorry, page isn't
there.  If we're clearing a bit then we go to the trouble of
revalidating.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Index-only scan performance regression

From
Dean Rasheed
Date:
On 31 January 2012 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> Given a freshly created table (not vacuumed), and a query that uses an
>> index-only scan, for example:
>>
>> CREATE TABLE foo(a int PRIMARY KEY);
>> INSERT INTO foo SELECT * FROM generate_series(1,1000000);
>> ANALYSE foo;
>>
>> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;
>>
>>                                                            QUERY PLAN
>>
-----------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
>> time=23.646..23.646 rows=1 loops=1)
>>   ->  Index Only Scan using foo_pkey on foo  (cost=0.00..300.42
>> rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
>>         Index Cond: (a <= 10000)
>>         Heap Fetches: 10000
>>  Total runtime: 23.673 ms
>> (5 rows)
>>
>>
>> the actual performance is much worse than the equivalent index scan,
>> as used in 9.1 and earlier:
>>
>> SET enable_indexonlyscan = off;
>> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;
>>
>>                                                         QUERY PLAN
>>
-----------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
>> time=3.219..3.219 rows=1 loops=1)
>>   ->  Index Scan using foo_pkey on foo  (cost=0.00..300.42 rows=8975
>> width=0) (actual time=0.014..2.302 rows=10000 loops=1)
>>         Index Cond: (a <= 10000)
>>  Total runtime: 3.240 ms
>> (4 rows)
>>
>>
>> Obviously this is the worst-case for an index-only scan, since there
>> is no visibility map, and it has to fetch each tuple from the heap,
>> but ideally this should perform around the same as an ordinary index
>> scan, since it's doing pretty much the same work.
>
> Yeah.
>
>> Digging around, it looks like the additional cost is coming from
>> visibilitymap_test(), which is calling smgrexists() for each tuple, to
>> see if the visibility map file has been created. So it's doing a file
>> access check for each row, while the visibility map doesn't exist.
>>
>> I'm not really familiar with this code, but a possible fix seems to be
>> to send an invalidation message in vm_extend() when it creates or
>> extends the visibility map, so that vm_readbuf() only has to re-check
>> the visibility map file if smgr_vm_nblocks has been invalidated. With
>> this change (attached) the worst-case index-only scan time becomes
>> around the same as the index scan time.
>
> I think that the additional sinval message might not do much good,
> because it won't necessarily be seen until the next time any given
> other backend does AcceptInvalidationMessages(), which might be a
> while.  That's not an issue for truncation because it requires
> AccessExclusiveLock, so we know that any subsequent access to the
> table will involve taking a heavyweight lock, which will
> AcceptInvalidationMessages().  Making vismap extension require
> AccessExclusiveLock would be a cure worse than the disease.
>
> In the case of an index-only scan, it wouldn't be the end of the world
> if we failed to notice that the relation had been extended.  The worst
> case would be that we wouldn't read the visibility map buffer and, as
> you say, that shouldn't be any worse than a normal index scan.  The
> only time we had better not fail to notice that the visibility map has
> been extended is when we're asked to clear a bit, because failure to
> notice that the new page is there - with a set bit that must now be
> cleared - could result in queries returning wrong answers.  So one
> possibility would be to tweak the relcache stuff to have a flag
> indicating whether the value is cached, and make that separate from
> the cached value.  Then, if the value is cached, and we're doing
> anything other than clearing a bit, we just say, eh, sorry, page isn't
> there.  If we're clearing a bit then we go to the trouble of
> revalidating.
>
> Thoughts?
>

In the case when we're asked to clear a bit, it would first try to pin
the relevant page, which would go through vm_readbuf() with extend set
to true. Then vm_extend() would notice that the visibility map had
already been extended, and it would read in the new page with the set
bit. So this case would continue to work, wouldn't it?

Regards,
Dean


Re: Index-only scan performance regression

From
Robert Haas
Date:
On Tue, Jan 31, 2012 at 2:15 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> In the case when we're asked to clear a bit, it would first try to pin
> the relevant page, which would go through vm_readbuf() with extend set
> to true. Then vm_extend() would notice that the visibility map had
> already been extended, and it would read in the new page with the set
> bit. So this case would continue to work, wouldn't it?

Ah, maybe.  I haven't tried to code it up.  Do you want to take a crack at it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Index-only scan performance regression

From
Dean Rasheed
Date:
On 31 January 2012 19:49, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 31, 2012 at 2:15 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> In the case when we're asked to clear a bit, it would first try to pin
>> the relevant page, which would go through vm_readbuf() with extend set
>> to true. Then vm_extend() would notice that the visibility map had
>> already been extended, and it would read in the new page with the set
>> bit. So this case would continue to work, wouldn't it?
>
> Ah, maybe.  I haven't tried to code it up.  Do you want to take a crack at it?
>

Well the patch upthread works when it comes to reducing the cost of
testing bits in the visibility map by not bothering to look for pages
which appear to be past the end of the file. And when pinning pages
for setting or clearing bits, it still checks for new pages created by
other backends. The thing I'm unsure about is whether sending sinval
messages when the visibility map is extended is a good idea. It could
work without that, but then a backend would never notice that the
visibility map had been created or extended by another process, so
it's index-only scans would continue to perform like normal index
scans. The sinval messages would cure that (eventually) but I'm not
sure what other side-effects that might have.

Regards,
Dean


Re: Index-only scan performance regression

From
Tom Lane
Date:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> The thing I'm unsure about is whether sending sinval
> messages when the visibility map is extended is a good idea.

Seems perfectly reasonable to me.  They'd occur so seldom as to be
more than repaid if we can scrape some cost out of the mainline paths.

The real objection to this probably is that if it only saves anything
for tables that don't have a VM yet, it's dubious whether it's worth
doing.  But if we can avoid wasted checks for VM extension as well,
then I think it's probably a no-brainer.
        regards, tom lane


Re: Index-only scan performance regression

From
Dean Rasheed
Date:
On 31 January 2012 23:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> The thing I'm unsure about is whether sending sinval
>> messages when the visibility map is extended is a good idea.
>
> Seems perfectly reasonable to me.  They'd occur so seldom as to be
> more than repaid if we can scrape some cost out of the mainline paths.
>

OK, thanks. That's good.


> The real objection to this probably is that if it only saves anything
> for tables that don't have a VM yet, it's dubious whether it's worth
> doing.  But if we can avoid wasted checks for VM extension as well,
> then I think it's probably a no-brainer.
>
>                        regards, tom lane

Yes it applies in the same way to VM extension - if the table has
grown and the VM has not yet been extended, but I don't see why that
is any worse than the case of not having a VM yet.

Actually I think that this is not such an uncommon case - for a table
which has only had data inserted - no deletes or updates - it is
tempting to think that ANALYSE is sufficient, and that there is no
need to VACUUM. If it were simply the case that this caused an
index-only scan to have no real benefit, you might be willing to live
with normal index scan performance. But actually it causes a very
significant performance regression beyond that, to well below 9.1
performance.

Regards,
Dean


Re: Index-only scan performance regression

From
Robert Haas
Date:
On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> The real objection to this probably is that if it only saves anything
>> for tables that don't have a VM yet, it's dubious whether it's worth
>> doing.  But if we can avoid wasted checks for VM extension as well,
>> then I think it's probably a no-brainer.
>
> Yes it applies in the same way to VM extension - if the table has
> grown and the VM has not yet been extended, but I don't see why that
> is any worse than the case of not having a VM yet.
>
> Actually I think that this is not such an uncommon case - for a table
> which has only had data inserted - no deletes or updates - it is
> tempting to think that ANALYSE is sufficient, and that there is no
> need to VACUUM. If it were simply the case that this caused an
> index-only scan to have no real benefit, you might be willing to live
> with normal index scan performance. But actually it causes a very
> significant performance regression beyond that, to well below 9.1
> performance.

So, I guess the trade-off here is that, since sinval messages aren't
processed immediately, we often won't notice the VM extension until
the next statement starts, whereas with the current implementation, we
notice it right away.  On the other hand, noticing it right away is
costing us a system call or two per tuple.  So on further thought, I
think we should do this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Index-only scan performance regression

From
Robert Haas
Date:
On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>> The real objection to this probably is that if it only saves anything
>>> for tables that don't have a VM yet, it's dubious whether it's worth
>>> doing.  But if we can avoid wasted checks for VM extension as well,
>>> then I think it's probably a no-brainer.
>>
>> Yes it applies in the same way to VM extension - if the table has
>> grown and the VM has not yet been extended, but I don't see why that
>> is any worse than the case of not having a VM yet.
>>
>> Actually I think that this is not such an uncommon case - for a table
>> which has only had data inserted - no deletes or updates - it is
>> tempting to think that ANALYSE is sufficient, and that there is no
>> need to VACUUM. If it were simply the case that this caused an
>> index-only scan to have no real benefit, you might be willing to live
>> with normal index scan performance. But actually it causes a very
>> significant performance regression beyond that, to well below 9.1
>> performance.
>
> So, I guess the trade-off here is that, since sinval messages aren't
> processed immediately, we often won't notice the VM extension until
> the next statement starts, whereas with the current implementation, we
> notice it right away.  On the other hand, noticing it right away is
> costing us a system call or two per tuple.  So on further thought, I
> think we should do this.

Patch committed.  I moved the smgr inval to after the actual extension
is done, which seems superior, and adjusted the comments slightly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Index-only scan performance regression

From
Dean Rasheed
Date:
On 2 February 2012 01:40, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> So, I guess the trade-off here is that, since sinval messages aren't
>> processed immediately, we often won't notice the VM extension until
>> the next statement starts, whereas with the current implementation, we
>> notice it right away.  On the other hand, noticing it right away is
>> costing us a system call or two per tuple.  So on further thought, I
>> think we should do this.
>

Yes, that's a nice summary.

> Patch committed.  I moved the smgr inval to after the actual extension
> is done, which seems superior, and adjusted the comments slightly.
>

Thanks.

Regards,
Dean