Thread: Index-only scan performance regression
Given a freshly created table (not vacuumed), and a query that uses an index-only scan, for example: CREATE TABLE foo(a int PRIMARY KEY); INSERT INTO foo SELECT * FROM generate_series(1,1000000); ANALYSE foo; EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=322.86..322.87 rows=1 width=0) (actual time=23.646..23.646 rows=1 loops=1) -> Index Only Scan using foo_pkey on foo (cost=0.00..300.42 rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1) Index Cond: (a <= 10000) Heap Fetches: 10000 Total runtime: 23.673 ms (5 rows) the actual performance is much worse than the equivalent index scan, as used in 9.1 and earlier: SET enable_indexonlyscan = off; EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=322.86..322.87 rows=1 width=0) (actual time=3.219..3.219 rows=1 loops=1) -> Index Scan using foo_pkey on foo (cost=0.00..300.42 rows=8975 width=0) (actual time=0.014..2.302 rows=10000 loops=1) Index Cond: (a <= 10000) Total runtime: 3.240 ms (4 rows) Obviously this is the worst-case for an index-only scan, since there is no visibility map, and it has to fetch each tuple from the heap, but ideally this should perform around the same as an ordinary index scan, since it's doing pretty much the same work. Digging around, it looks like the additional cost is coming from visibilitymap_test(), which is calling smgrexists() for each tuple, to see if the visibility map file has been created. So it's doing a file access check for each row, while the visibility map doesn't exist. I'm not really familiar with this code, but a possible fix seems to be to send an invalidation message in vm_extend() when it creates or extends the visibility map, so that vm_readbuf() only has to re-check the visibility map file if smgr_vm_nblocks has been invalidated. With this change (attached) the worst-case index-only scan time becomes around the same as the index scan time. Regards, Dean
Attachment
On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > Given a freshly created table (not vacuumed), and a query that uses an > index-only scan, for example: > > CREATE TABLE foo(a int PRIMARY KEY); > INSERT INTO foo SELECT * FROM generate_series(1,1000000); > ANALYSE foo; > > EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000; > > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=322.86..322.87 rows=1 width=0) (actual > time=23.646..23.646 rows=1 loops=1) > -> Index Only Scan using foo_pkey on foo (cost=0.00..300.42 > rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1) > Index Cond: (a <= 10000) > Heap Fetches: 10000 > Total runtime: 23.673 ms > (5 rows) > > > the actual performance is much worse than the equivalent index scan, > as used in 9.1 and earlier: > > SET enable_indexonlyscan = off; > EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000; > > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=322.86..322.87 rows=1 width=0) (actual > time=3.219..3.219 rows=1 loops=1) > -> Index Scan using foo_pkey on foo (cost=0.00..300.42 rows=8975 > width=0) (actual time=0.014..2.302 rows=10000 loops=1) > Index Cond: (a <= 10000) > Total runtime: 3.240 ms > (4 rows) > > > Obviously this is the worst-case for an index-only scan, since there > is no visibility map, and it has to fetch each tuple from the heap, > but ideally this should perform around the same as an ordinary index > scan, since it's doing pretty much the same work. Yeah. > Digging around, it looks like the additional cost is coming from > visibilitymap_test(), which is calling smgrexists() for each tuple, to > see if the visibility map file has been created. So it's doing a file > access check for each row, while the visibility map doesn't exist. > > I'm not really familiar with this code, but a possible fix seems to be > to send an invalidation message in vm_extend() when it creates or > extends the visibility map, so that vm_readbuf() only has to re-check > the visibility map file if smgr_vm_nblocks has been invalidated. With > this change (attached) the worst-case index-only scan time becomes > around the same as the index scan time. I think that the additional sinval message might not do much good, because it won't necessarily be seen until the next time any given other backend does AcceptInvalidationMessages(), which might be a while. That's not an issue for truncation because it requires AccessExclusiveLock, so we know that any subsequent access to the table will involve taking a heavyweight lock, which will AcceptInvalidationMessages(). Making vismap extension require AccessExclusiveLock would be a cure worse than the disease. In the case of an index-only scan, it wouldn't be the end of the world if we failed to notice that the relation had been extended. The worst case would be that we wouldn't read the visibility map buffer and, as you say, that shouldn't be any worse than a normal index scan. The only time we had better not fail to notice that the visibility map has been extended is when we're asked to clear a bit, because failure to notice that the new page is there - with a set bit that must now be cleared - could result in queries returning wrong answers. So one possibility would be to tweak the relcache stuff to have a flag indicating whether the value is cached, and make that separate from the cached value. Then, if the value is cached, and we're doing anything other than clearing a bit, we just say, eh, sorry, page isn't there. If we're clearing a bit then we go to the trouble of revalidating. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31 January 2012 17:35, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> Given a freshly created table (not vacuumed), and a query that uses an >> index-only scan, for example: >> >> CREATE TABLE foo(a int PRIMARY KEY); >> INSERT INTO foo SELECT * FROM generate_series(1,1000000); >> ANALYSE foo; >> >> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000; >> >> QUERY PLAN >> ----------------------------------------------------------------------------------------------------------------------------------- >> Aggregate (cost=322.86..322.87 rows=1 width=0) (actual >> time=23.646..23.646 rows=1 loops=1) >> -> Index Only Scan using foo_pkey on foo (cost=0.00..300.42 >> rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1) >> Index Cond: (a <= 10000) >> Heap Fetches: 10000 >> Total runtime: 23.673 ms >> (5 rows) >> >> >> the actual performance is much worse than the equivalent index scan, >> as used in 9.1 and earlier: >> >> SET enable_indexonlyscan = off; >> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000; >> >> QUERY PLAN >> ----------------------------------------------------------------------------------------------------------------------------- >> Aggregate (cost=322.86..322.87 rows=1 width=0) (actual >> time=3.219..3.219 rows=1 loops=1) >> -> Index Scan using foo_pkey on foo (cost=0.00..300.42 rows=8975 >> width=0) (actual time=0.014..2.302 rows=10000 loops=1) >> Index Cond: (a <= 10000) >> Total runtime: 3.240 ms >> (4 rows) >> >> >> Obviously this is the worst-case for an index-only scan, since there >> is no visibility map, and it has to fetch each tuple from the heap, >> but ideally this should perform around the same as an ordinary index >> scan, since it's doing pretty much the same work. > > Yeah. > >> Digging around, it looks like the additional cost is coming from >> visibilitymap_test(), which is calling smgrexists() for each tuple, to >> see if the visibility map file has been created. So it's doing a file >> access check for each row, while the visibility map doesn't exist. >> >> I'm not really familiar with this code, but a possible fix seems to be >> to send an invalidation message in vm_extend() when it creates or >> extends the visibility map, so that vm_readbuf() only has to re-check >> the visibility map file if smgr_vm_nblocks has been invalidated. With >> this change (attached) the worst-case index-only scan time becomes >> around the same as the index scan time. > > I think that the additional sinval message might not do much good, > because it won't necessarily be seen until the next time any given > other backend does AcceptInvalidationMessages(), which might be a > while. That's not an issue for truncation because it requires > AccessExclusiveLock, so we know that any subsequent access to the > table will involve taking a heavyweight lock, which will > AcceptInvalidationMessages(). Making vismap extension require > AccessExclusiveLock would be a cure worse than the disease. > > In the case of an index-only scan, it wouldn't be the end of the world > if we failed to notice that the relation had been extended. The worst > case would be that we wouldn't read the visibility map buffer and, as > you say, that shouldn't be any worse than a normal index scan. The > only time we had better not fail to notice that the visibility map has > been extended is when we're asked to clear a bit, because failure to > notice that the new page is there - with a set bit that must now be > cleared - could result in queries returning wrong answers. So one > possibility would be to tweak the relcache stuff to have a flag > indicating whether the value is cached, and make that separate from > the cached value. Then, if the value is cached, and we're doing > anything other than clearing a bit, we just say, eh, sorry, page isn't > there. If we're clearing a bit then we go to the trouble of > revalidating. > > Thoughts? > In the case when we're asked to clear a bit, it would first try to pin the relevant page, which would go through vm_readbuf() with extend set to true. Then vm_extend() would notice that the visibility map had already been extended, and it would read in the new page with the set bit. So this case would continue to work, wouldn't it? Regards, Dean
On Tue, Jan 31, 2012 at 2:15 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > In the case when we're asked to clear a bit, it would first try to pin > the relevant page, which would go through vm_readbuf() with extend set > to true. Then vm_extend() would notice that the visibility map had > already been extended, and it would read in the new page with the set > bit. So this case would continue to work, wouldn't it? Ah, maybe. I haven't tried to code it up. Do you want to take a crack at it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31 January 2012 19:49, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 31, 2012 at 2:15 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> In the case when we're asked to clear a bit, it would first try to pin >> the relevant page, which would go through vm_readbuf() with extend set >> to true. Then vm_extend() would notice that the visibility map had >> already been extended, and it would read in the new page with the set >> bit. So this case would continue to work, wouldn't it? > > Ah, maybe. I haven't tried to code it up. Do you want to take a crack at it? > Well the patch upthread works when it comes to reducing the cost of testing bits in the visibility map by not bothering to look for pages which appear to be past the end of the file. And when pinning pages for setting or clearing bits, it still checks for new pages created by other backends. The thing I'm unsure about is whether sending sinval messages when the visibility map is extended is a good idea. It could work without that, but then a backend would never notice that the visibility map had been created or extended by another process, so it's index-only scans would continue to perform like normal index scans. The sinval messages would cure that (eventually) but I'm not sure what other side-effects that might have. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > The thing I'm unsure about is whether sending sinval > messages when the visibility map is extended is a good idea. Seems perfectly reasonable to me. They'd occur so seldom as to be more than repaid if we can scrape some cost out of the mainline paths. The real objection to this probably is that if it only saves anything for tables that don't have a VM yet, it's dubious whether it's worth doing. But if we can avoid wasted checks for VM extension as well, then I think it's probably a no-brainer. regards, tom lane
On 31 January 2012 23:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> The thing I'm unsure about is whether sending sinval >> messages when the visibility map is extended is a good idea. > > Seems perfectly reasonable to me. They'd occur so seldom as to be > more than repaid if we can scrape some cost out of the mainline paths. > OK, thanks. That's good. > The real objection to this probably is that if it only saves anything > for tables that don't have a VM yet, it's dubious whether it's worth > doing. But if we can avoid wasted checks for VM extension as well, > then I think it's probably a no-brainer. > > regards, tom lane Yes it applies in the same way to VM extension - if the table has grown and the VM has not yet been extended, but I don't see why that is any worse than the case of not having a VM yet. Actually I think that this is not such an uncommon case - for a table which has only had data inserted - no deletes or updates - it is tempting to think that ANALYSE is sufficient, and that there is no need to VACUUM. If it were simply the case that this caused an index-only scan to have no real benefit, you might be willing to live with normal index scan performance. But actually it causes a very significant performance regression beyond that, to well below 9.1 performance. Regards, Dean
On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> The real objection to this probably is that if it only saves anything >> for tables that don't have a VM yet, it's dubious whether it's worth >> doing. But if we can avoid wasted checks for VM extension as well, >> then I think it's probably a no-brainer. > > Yes it applies in the same way to VM extension - if the table has > grown and the VM has not yet been extended, but I don't see why that > is any worse than the case of not having a VM yet. > > Actually I think that this is not such an uncommon case - for a table > which has only had data inserted - no deletes or updates - it is > tempting to think that ANALYSE is sufficient, and that there is no > need to VACUUM. If it were simply the case that this caused an > index-only scan to have no real benefit, you might be willing to live > with normal index scan performance. But actually it causes a very > significant performance regression beyond that, to well below 9.1 > performance. So, I guess the trade-off here is that, since sinval messages aren't processed immediately, we often won't notice the VM extension until the next statement starts, whereas with the current implementation, we notice it right away. On the other hand, noticing it right away is costing us a system call or two per tuple. So on further thought, I think we should do this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >>> The real objection to this probably is that if it only saves anything >>> for tables that don't have a VM yet, it's dubious whether it's worth >>> doing. But if we can avoid wasted checks for VM extension as well, >>> then I think it's probably a no-brainer. >> >> Yes it applies in the same way to VM extension - if the table has >> grown and the VM has not yet been extended, but I don't see why that >> is any worse than the case of not having a VM yet. >> >> Actually I think that this is not such an uncommon case - for a table >> which has only had data inserted - no deletes or updates - it is >> tempting to think that ANALYSE is sufficient, and that there is no >> need to VACUUM. If it were simply the case that this caused an >> index-only scan to have no real benefit, you might be willing to live >> with normal index scan performance. But actually it causes a very >> significant performance regression beyond that, to well below 9.1 >> performance. > > So, I guess the trade-off here is that, since sinval messages aren't > processed immediately, we often won't notice the VM extension until > the next statement starts, whereas with the current implementation, we > notice it right away. On the other hand, noticing it right away is > costing us a system call or two per tuple. So on further thought, I > think we should do this. Patch committed. I moved the smgr inval to after the actual extension is done, which seems superior, and adjusted the comments slightly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 February 2012 01:40, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> So, I guess the trade-off here is that, since sinval messages aren't >> processed immediately, we often won't notice the VM extension until >> the next statement starts, whereas with the current implementation, we >> notice it right away. On the other hand, noticing it right away is >> costing us a system call or two per tuple. So on further thought, I >> think we should do this. > Yes, that's a nice summary. > Patch committed. I moved the smgr inval to after the actual extension > is done, which seems superior, and adjusted the comments slightly. > Thanks. Regards, Dean