Thread: the big picture for index-only scans
So, what do we need in order to find our way to index-only scans? 1. The visibility map needs to be crash-safe. The basic idea of index-only scans is that, instead of checking the heap to find out whether each tuple is visible, we first check the visibility map. If the visibility map bit is set, then we know all tuples on the page are visible to all transactions, and therefore the tuple of interest is visible to our transaction. Assuming that a significant number of visibility map bits are set, this should enable us to avoid a fair amount of I/O, especially on large tables, because the visibility map is roughly 8000 times smaller than the heap, and therefore far more practical to keep in cache. However, before we can rely on the visibility map for this purpose, we need to fix the problems that can leave bits set inappropriately in the face of an inconveniently-timed crash. I've been working on a patch for this on and off for a few months now; my latest version is in need of review[1]. 2. Crash safe visibility map vs. pg_upgrade. Even if we make the visibility map crash-safe in 9.2, people are going to want to use pg_upgrade to migrate from older versions, bringing their possibly-not-quite-correct visibility map forks along with them. How should we handle that? We could (2A) arrange to have pg_upgrade nuke all visibility forks when upgrading from a release where the visibility map is not crash-safe to one where it is; (2B) keep a piece of state somewhere indicating, for each relation, whether or not the visibility map can be trusted, set it to false only if pg_upgrade brings the relation over from and older version, and set it to true after a successful vacuum that skips no intervening pages; or (2C) advise the user to do a VACUUM FULL on each of their tables pre-upgrade, and if they don't, treat wrong answers as their own fault. (I doubt anyone will advocate for this option, but for the sake of completeness...). (2A) seems like the simplest solution, especially because it also avoids the overhead of checking the "is the visibility map bit reliable?" flag every time we want to plan a query. 3. Statistics. I believe that in order to accurately estimate the cost of an index-only scan, we're going to need to know the fraction of tuples that are on pages whose visibility map bits are set. I believe it should be fairly straightforward to have ANALYZE collect this information; and I'm inclined to do that as a separate patch. It seems like it would also be nice to know what fraction of tuples are on pages that don't have the visibility map set but where, in fact, all tuples on the page are visible to all transactions, so it would be legal to set the bit. A large discrepancy between these two percentages might be a good reason to auto-vacuum the table (perhaps using a "really lazy vacuum"[2]). I'm not sure if this can be added cheaply, though, and in any case, any change to the set of criteria that will trigger an auto-vacuum is probably a can of worms. Thoughts? 4. Planner and executor changes. In contrast to Heikki's original implementation, I'm inclined to not to try to split the Index Scan node into index scan and heap fetch. Since there are many choices for where to put the heap fetch node (any level of the join tree between the index scan and the root), this seems likely to result in a combinatorial explosion of paths[3], and I'm not real sure that the payback will be adequate. Furthermore, the idea of allowing user code to see tuples that will only later be determined not to have been visible to that MVCC snapshot in the first place seems pretty scary from a security perspective, though certainly there are possible benefits[4]. Instead, I'm inclined to just have the planner evaluate whether the necessary columns can be extracted purely from the index. If not, we proceed as now. If so, we can use the "index only" approach of using the visibility map to decide which heap fetches can be skipped. It's not clear to me whether we need to compare the cost of the standard approach with the cost of the "index only" approach: in theory, if there aren't any visibility map bits anyway, the "index only" approach could be slower. But I'm not sure whether that problem is significant or common enough to be worth expending a lot of code on. Either way, the number of actual paths doesn't need to increase, because in this design, even if we apply a costing model, one approach will dominate the other. Heikki also suggested considering index scans in cases where we don't now[4, again] but I'm inclined to leave this, too, for a later optimization, again because balancing the increase in paths against the possible performance benefits of using indexes in more situations seems finicky. In short, for a first cut at this, I just want to look at this as a way to get cheaper index scans, and leave everything else to future work. Any thoughts welcome. Incidentally, if anyone else feels like working on this, feel free to let me know and I'm happy to step away, from all of it or from whatever part someone else wants to tackle. I'm mostly working on this because it's something that I think we really need to get done, more than having a burning desire to be the one who does it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] http://archives.postgresql.org/pgsql-hackers/2011-05/msg00292.php [2] http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php [3] http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php [4] http://archives.postgresql.org/pgsql-hackers/2009-07/msg00675.php
On Mon, May 9, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > So, what do we need in order to find our way to index-only scans? > > 1. The visibility map needs to be crash-safe. The basic idea of > index-only scans is that, instead of checking the heap to find out > whether each tuple is visible, we first check the visibility map. If > the visibility map bit is set, then we know all tuples on the page are > visible to all transactions, and therefore the tuple of interest is > visible to our transaction. Assuming that a significant number of > visibility map bits are set, this should enable us to avoid a fair > amount of I/O, especially on large tables, because the visibility map > is roughly 8000 times smaller than the heap, and therefore far more > practical to keep in cache. hm, what are the implications for tuple hint bits, short and long term? I'm particularly interested if you think any hint bit i/o mitigation strategies are worth pursuing. > 2. Crash safe visibility map vs. pg_upgrade. Even if we make the > visibility map crash-safe in 9.2, people are going to want to use > pg_upgrade to migrate from older versions, bringing their > possibly-not-quite-correct visibility map forks along with them. How > should we handle that? We could (2A) arrange to have pg_upgrade nuke > all visibility forks when upgrading from a release where the > visibility map is not crash-safe to one where it is; +1 on 2A. > 3. Statistics. I believe that in order to accurately estimate the > cost of an index-only scan, we're going to need to know the fraction > of tuples that are on pages whose visibility map bits are set. It would be helpful to know the performance benefit of index only scans before knowing how much benefit to attribute here. Maybe a system wide kludge would for starters anyway, like assuming 60% of pages can be vis checked from the VM, or a single GUC, Then again, maybe not. merlin
On Mon, May 9, 2011 at 10:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> 1. The visibility map needs to be crash-safe. The basic idea of >> index-only scans is that, instead of checking the heap to find out >> whether each tuple is visible, we first check the visibility map. If >> the visibility map bit is set, then we know all tuples on the page are >> visible to all transactions, and therefore the tuple of interest is >> visible to our transaction. Assuming that a significant number of >> visibility map bits are set, this should enable us to avoid a fair >> amount of I/O, especially on large tables, because the visibility map >> is roughly 8000 times smaller than the heap, and therefore far more >> practical to keep in cache. > > hm, what are the implications for tuple hint bits, short and long > term? I'm particularly interested if you think any hint bit i/o > mitigation strategies are worth pursuing. Well, I don't really want to let this thread on my project get hijacked to talk about your project (not that I haven't been guilty of that myself!) but, in brief, I think the main effect of index-only scans is that the performance difference between a vacuumed table and an unvacuumed table is going to increase. It's already the case that sequential scanning a table which has been vacuumed (and, therefore, all the pages are marked all-visible) is noticeably faster than sequential scanning a table which is not vacuumed (even if all the hint bits are set). Index-only scans are going to extend that by making index scans run faster on a table with lots of all-visible tables than on one where no pages are all-visible. So the importance of vacuuming an insert-only table occasionally (which autovacuum won't do, at present, until it's needed to prevent XID wraparound) is already more than zero, and it's going to go up. But the all-visible bits aren't quite the same as hint bits: I don't think there's any impact on hint bits per se. >> 2. Crash safe visibility map vs. pg_upgrade. Even if we make the >> visibility map crash-safe in 9.2, people are going to want to use >> pg_upgrade to migrate from older versions, bringing their >> possibly-not-quite-correct visibility map forks along with them. How >> should we handle that? We could (2A) arrange to have pg_upgrade nuke >> all visibility forks when upgrading from a release where the >> visibility map is not crash-safe to one where it is; > > +1 on 2A. OK. Anybody else? >> 3. Statistics. I believe that in order to accurately estimate the >> cost of an index-only scan, we're going to need to know the fraction >> of tuples that are on pages whose visibility map bits are set. > > It would be helpful to know the performance benefit of index only > scans before knowing how much benefit to attribute here. Maybe a > system wide kludge would for starters anyway, like assuming 60% of > pages can be vis checked from the VM, or a single GUC, Then again, > maybe not. Yeah, maybe I should try to beat the main patch into some kind of shape before working too much on the statistics stuff. Then we could actually benchmark it a bit, which would be good. I don't think that a system-wide kludge or GUC is going to work for prime time, but it's probably fine for initial performance testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2011 at 8:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 9, 2011 at 10:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> 1. The visibility map needs to be crash-safe. The basic idea of >>> index-only scans is that, instead of checking the heap to find out >>> whether each tuple is visible, we first check the visibility map. If >>> the visibility map bit is set, then we know all tuples on the page are >>> visible to all transactions, and therefore the tuple of interest is >>> visible to our transaction. Assuming that a significant number of >>> visibility map bits are set, this should enable us to avoid a fair >>> amount of I/O, especially on large tables, because the visibility map >>> is roughly 8000 times smaller than the heap, and therefore far more >>> practical to keep in cache. >> >> hm, what are the implications for tuple hint bits, short and long >> term? I'm particularly interested if you think any hint bit i/o >> mitigation strategies are worth pursuing. > > Well, I don't really want to let this thread on my project get > hijacked to talk about your project (not that I haven't been guilty of > that myself!) no, that wasn't my intent at all, except in the sense of wondering if a crash-safe visibility map provides a route of displacing a lot of hint bit i/o and by extension, making alternative approaches of doing that, including mine, a lot less useful. that's a good thing. meaning: since the vis map approach is going to be a fairly large win over the classic approach to checking visibility in so many scenarios, maybe the real long term goal should be just being as aggressive as possible in terms of making sure it's set properly, and just give up and be a bit more brute forcey when it's not set. it's a fair question. that's a pretty broad statement, but that's what I'm thinking about. merlin
2011/5/10 Robert Haas <robertmhaas@gmail.com>: > So, what do we need in order to find our way to index-only scans? > > 3. Statistics. I believe that in order to accurately estimate the > cost of an index-only scan, we're going to need to know the fraction > of tuples that are on pages whose visibility map bits are set. I > believe it should be fairly straightforward to have ANALYZE collect > this information; and I'm inclined to do that as a separate patch. It > seems like it would also be nice to know what fraction of tuples are > on pages that don't have the visibility map set but where, in fact, > all tuples on the page are visible to all transactions, so it would be > legal to set the bit. A large discrepancy between these two > percentages might be a good reason to auto-vacuum the table (perhaps > using a "really lazy vacuum"[2]). I'm not sure if this can be added > cheaply, though, and in any case, any change to the set of criteria > that will trigger an auto-vacuum is probably a can of worms. > Thoughts? ANALYZE can do the stats job for 'free' on the pages it collects anyway. So that looks like a good idea. I believe the really lazy vacuum is another topic; even if it will improve the performance of the index only scan to have tables already vacuuumed, the stats should expose that and the function cost_index(_only?)() taking care of that. > > 4. Planner and executor changes. In contrast to Heikki's original > implementation, I'm inclined to not to try to split the Index Scan > node into index scan and heap fetch. Since there are many choices for > where to put the heap fetch node (any level of the join tree between > the index scan and the root), this seems likely to result in a > combinatorial explosion of paths[3], and I'm not real sure that the > payback will be adequate. Furthermore, the idea of allowing user code > to see tuples that will only later be determined not to have been > visible to that MVCC snapshot in the first place seems pretty scary > from a security perspective, though certainly there are possible > benefits[4]. Instead, I'm inclined to just have the planner evaluate > whether the necessary columns can be extracted purely from the index. The temptation is high to estimate the cost of an "index_scan(only) + ordered(by ctid) table pages fetch if heap required". (this is what I understood from heikki suggestion 3-4. and it makes sense). It may be easier to implement both at once but I didn't find the branch in the Heikki's git repos. (probably removed since the long time) > If not, we proceed as now. If so, we can use the "index only" > approach of using the visibility map to decide which heap fetches can > be skipped. It's not clear to me whether we need to compare the cost > of the standard approach with the cost of the "index only" approach: > in theory, if there aren't any visibility map bits anyway, the "index > only" approach could be slower. But I'm not sure whether that problem > is significant or common enough to be worth expending a lot of code > on. Either way, the number of actual paths doesn't need to increase, > because in this design, even if we apply a costing model, one approach > will dominate the other. Heikki also suggested considering index > scans in cases where we don't now[4, again] but I'm inclined to leave > this, too, for a later optimization, again because balancing the > increase in paths against the possible performance benefits of using > indexes in more situations seems finicky. In short, for a first cut > at this, I just want to look at this as a way to get cheaper index > scans, and leave everything else to future work. Based on ANALYZE stats for the visibility, I believe cost_index and cost_index_only should be very similar functions (well, atm, I don't see the point to split it in 2 functions). > > Any thoughts welcome. Incidentally, if anyone else feels like working > on this, feel free to let me know and I'm happy to step away, from all > of it or from whatever part someone else wants to tackle. I'm mostly > working on this because it's something that I think we really need to > get done, more than having a burning desire to be the one who does it. Indexonly scans are welcome! I believe I can help on 3 and 4, but (really) not sure for 1 and 2. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > [1] http://archives.postgresql.org/pgsql-hackers/2011-05/msg00292.php > [2] http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php > [3] http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php > [4] http://archives.postgresql.org/pgsql-hackers/2009-07/msg00675.php > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Tue, May 10, 2011 at 10:58 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > ANALYZE can do the stats job for 'free' on the pages it collects > anyway. So that looks like a good idea. > I believe the really lazy vacuum is another topic; even if it will > improve the performance of the index only scan to have tables already > vacuuumed, the stats should expose that and the function > cost_index(_only?)() taking care of that. I basically agree. The connection is that - as we use the all-visible for more things, the performance penalty for failing to vacuum (say) an insert-only table will continue to grow. Still, as you say, clearly a separate topic. > The temptation is high to estimate the cost of an "index_scan(only) + > ordered(by ctid) table pages fetch if heap required". (this is what I > understood from heikki suggestion 3-4. and it makes sense). It may be > easier to implement both at once but I didn't find the branch in the > Heikki's git repos. (probably removed since the long time) I was thinking about this as well, at least if I understand you correctly. That would be similar to a bitmap index scan, and I think it would be a great thing to have, not only because it would allow us to get the advantages of index-only scans in situations that are well-suited to our current bitmap scans, but also because it could be batched. You could allocate a buffer of work_mem bytes and fill it up with TIDs; then, when it's full, you sort the buffer and start doing the necessary heap fetches in physical order. If you still need more rows, you can clear the buffer and go around for another pass. > Based on ANALYZE stats for the visibility, I believe cost_index and > cost_index_only should be very similar functions (well, atm, I don't > see the point to split it in 2 functions). Yeah, I would more imagine modifying the existing function. >> Any thoughts welcome. Incidentally, if anyone else feels like working >> on this, feel free to let me know and I'm happy to step away, from all >> of it or from whatever part someone else wants to tackle. I'm mostly >> working on this because it's something that I think we really need to >> get done, more than having a burning desire to be the one who does it. > > Indexonly scans are welcome! > I believe I can help on 3 and 4, but (really) not sure for 1 and 2. Well, I have code for #1, and just need reviews, and #2 shouldn't be that hard, and with luck I'll twist Bruce's arm into doing it (*waves to Bruce*). So #3 and #4 are the next thing to tackle. Any thoughts on what/how you'd like to contribute there? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/10 Robert Haas <robertmhaas@gmail.com>: > On Tue, May 10, 2011 at 10:58 AM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >> ANALYZE can do the stats job for 'free' on the pages it collects >> anyway. So that looks like a good idea. >> I believe the really lazy vacuum is another topic; even if it will >> improve the performance of the index only scan to have tables already >> vacuuumed, the stats should expose that and the function >> cost_index(_only?)() taking care of that. > > I basically agree. The connection is that - as we use the all-visible > for more things, the performance penalty for failing to vacuum (say) > an insert-only table will continue to grow. Still, as you say, > clearly a separate topic. > >> The temptation is high to estimate the cost of an "index_scan(only) + >> ordered(by ctid) table pages fetch if heap required". (this is what I >> understood from heikki suggestion 3-4. and it makes sense). It may be >> easier to implement both at once but I didn't find the branch in the >> Heikki's git repos. (probably removed since the long time) > > I was thinking about this as well, at least if I understand you > correctly. That would be similar to a bitmap index scan, and I think > it would be a great thing to have, not only because it would allow us > to get the advantages of index-only scans in situations that are > well-suited to our current bitmap scans, but also because it could be > batched. You could allocate a buffer of work_mem bytes and fill it up > with TIDs; then, when it's full, you sort the buffer and start doing > the necessary heap fetches in physical order. If you still need more > rows, you can clear the buffer and go around for another pass. > >> Based on ANALYZE stats for the visibility, I believe cost_index and >> cost_index_only should be very similar functions (well, atm, I don't >> see the point to split it in 2 functions). > > Yeah, I would more imagine modifying the existing function. > >>> Any thoughts welcome. Incidentally, if anyone else feels like working >>> on this, feel free to let me know and I'm happy to step away, from all >>> of it or from whatever part someone else wants to tackle. I'm mostly >>> working on this because it's something that I think we really need to >>> get done, more than having a burning desire to be the one who does it. >> >> Indexonly scans are welcome! >> I believe I can help on 3 and 4, but (really) not sure for 1 and 2. > > Well, I have code for #1, and just need reviews, and #2 shouldn't be > that hard, and with luck I'll twist Bruce's arm into doing it (*waves > to Bruce*). So #3 and #4 are the next thing to tackle. Any thoughts > on what/how you'd like to contribute there? I can provide initial patchs for cost and analyze, at least. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Tue, May 10, 2011 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > So, what do we need in order to find our way to index-only scans? > > 1. The visibility map needs to be crash-safe. The basic idea of > index-only scans is that, instead of checking the heap to find out > whether each tuple is visible, we first check the visibility map. If This topic has been discussed many times, yet I have never seen an assessment that explains WHY we would want to do index-only scans. This will be a complex addition to the codebase and one that could introduce bugs into MVCC. It seems reasonable to look at what the benefit of this would be, and what the use case/ benefit profile is before we spend a long time adding this optimization. I asked for this previously on earlier threads also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>> The temptation is high to estimate the cost of an "index_scan(only) + >> ordered(by ctid) table pages fetch if heap required". (this is what I >> understood from heikki suggestion 3-4. and it makes sense). It may be >> easier to implement both at once but I didn't find the branch in the >> Heikki's git repos. (probably removed since the long time) > > I was thinking about this as well, at least if I understand you yes. > correctly. That would be similar to a bitmap index scan, and I think > it would be a great thing to have, not only because it would allow us > to get the advantages of index-only scans in situations that are > well-suited to our current bitmap scans, but also because it could be > batched. You could allocate a buffer of work_mem bytes and fill it up > with TIDs; then, when it's full, you sort the buffer and start doing > the necessary heap fetches in physical order. If you still need more > rows, you can clear the buffer and go around for another pass. Issue remaining here is that we don't have 'safe' Indexonly_scan, just indexscan with probability on the 'only'. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Simon Riggs <simon@2ndQuadrant.com> wrote: > This topic has been discussed many times, yet I have never seen an > assessment that explains WHY we would want to do index-only scans. In databases with this feature, it's not too unusual for a query which uses just an index to run one or more orders of magnitude faster than a query which has to randomly access the heap for each index entry. That seems like enough evidence of its possible value in PostgreSQL to proceed to the point where benchmarks become possible. I'm assuming that, like all other features added as performance optimizations, it won't be committed until there are benchmarks showing the net benefit. As a thought experiment, picture the relative costs of scanning a portion of an index in index sequence, and being done, versus scanning a portion of an index in index sequence and jumping to a random heap access for each index entry as you go. -Kevin
On Tue, May 10, 2011 at 11:27 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > 2011/5/10 Robert Haas <robertmhaas@gmail.com>: >> On Tue, May 10, 2011 at 10:58 AM, Cédric Villemain >> <cedric.villemain.debian@gmail.com> wrote: >>> ANALYZE can do the stats job for 'free' on the pages it collects >>> anyway. So that looks like a good idea. >>> I believe the really lazy vacuum is another topic; even if it will >>> improve the performance of the index only scan to have tables already >>> vacuuumed, the stats should expose that and the function >>> cost_index(_only?)() taking care of that. >> >> I basically agree. The connection is that - as we use the all-visible >> for more things, the performance penalty for failing to vacuum (say) >> an insert-only table will continue to grow. Still, as you say, >> clearly a separate topic. >> >>> The temptation is high to estimate the cost of an "index_scan(only) + >>> ordered(by ctid) table pages fetch if heap required". (this is what I >>> understood from heikki suggestion 3-4. and it makes sense). It may be >>> easier to implement both at once but I didn't find the branch in the >>> Heikki's git repos. (probably removed since the long time) >> >> I was thinking about this as well, at least if I understand you >> correctly. That would be similar to a bitmap index scan, and I think >> it would be a great thing to have, not only because it would allow us >> to get the advantages of index-only scans in situations that are >> well-suited to our current bitmap scans, but also because it could be >> batched. You could allocate a buffer of work_mem bytes and fill it up >> with TIDs; then, when it's full, you sort the buffer and start doing >> the necessary heap fetches in physical order. If you still need more >> rows, you can clear the buffer and go around for another pass. >> >>> Based on ANALYZE stats for the visibility, I believe cost_index and >>> cost_index_only should be very similar functions (well, atm, I don't >>> see the point to split it in 2 functions). >> >> Yeah, I would more imagine modifying the existing function. >> >>>> Any thoughts welcome. Incidentally, if anyone else feels like working >>>> on this, feel free to let me know and I'm happy to step away, from all >>>> of it or from whatever part someone else wants to tackle. I'm mostly >>>> working on this because it's something that I think we really need to >>>> get done, more than having a burning desire to be the one who does it. >>> >>> Indexonly scans are welcome! >>> I believe I can help on 3 and 4, but (really) not sure for 1 and 2. >> >> Well, I have code for #1, and just need reviews, and #2 shouldn't be >> that hard, and with luck I'll twist Bruce's arm into doing it (*waves >> to Bruce*). So #3 and #4 are the next thing to tackle. Any thoughts >> on what/how you'd like to contribute there? > > I can provide initial patchs for cost and analyze, at least. OK, cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Simon Riggs <simon@2ndQuadrant.com> wrote: >> This topic has been discussed many times, yet I have never seen an >> assessment that explains WHY we would want to do index-only scans. > In databases with this feature, it's not too unusual for a query > which uses just an index to run one or more orders of magnitude > faster than a query which has to randomly access the heap for each > index entry. That seems like enough evidence of its possible value > in PostgreSQL to proceed to the point where benchmarks become > possible. I'm assuming that, like all other features added as > performance optimizations, it won't be committed until there are > benchmarks showing the net benefit. > As a thought experiment, picture the relative costs of scanning a > portion of an index in index sequence, and being done, versus > scanning a portion of an index in index sequence and jumping to a > random heap access for each index entry as you go. It's already the case that we'll flip over to a bitmap indexscan, and thus get rid of most/all of the "random" page accesses, in situations where this is likely to be a big win. Pointing to the performance difference in databases that don't do that is therefore not too convincing. I'm inclined to agree that index-only scans might be worth the amount of work that's involved ... but I share Simon's desire to see some proof before anything gets committed. regards, tom lane
On Tue, May 10, 2011 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Simon Riggs <simon@2ndQuadrant.com> wrote: >>> This topic has been discussed many times, yet I have never seen an >>> assessment that explains WHY we would want to do index-only scans. > >> In databases with this feature, it's not too unusual for a query >> which uses just an index to run one or more orders of magnitude >> faster than a query which has to randomly access the heap for each >> index entry. That seems like enough evidence of its possible value >> in PostgreSQL to proceed to the point where benchmarks become >> possible. I'm assuming that, like all other features added as >> performance optimizations, it won't be committed until there are >> benchmarks showing the net benefit. > >> As a thought experiment, picture the relative costs of scanning a >> portion of an index in index sequence, and being done, versus >> scanning a portion of an index in index sequence and jumping to a >> random heap access for each index entry as you go. > > It's already the case that we'll flip over to a bitmap indexscan, > and thus get rid of most/all of the "random" page accesses, in > situations where this is likely to be a big win. Pointing to the > performance difference in databases that don't do that is therefore > not too convincing. > > I'm inclined to agree that index-only scans might be worth the amount > of work that's involved ... but I share Simon's desire to see some proof > before anything gets committed. Well, we're not in the habit of committing performance patches without performance numbers, so I doubt we'll reverse that trend now, and certainly I had no intention of doing so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2011 at 5:17 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: > >> This topic has been discussed many times, yet I have never seen an >> assessment that explains WHY we would want to do index-only scans. > > In databases with this feature, it's not too unusual for a query > which uses just an index to run one or more orders of magnitude > faster than a query which has to randomly access the heap for each > index entry. That seems like enough evidence of its possible value > in PostgreSQL to proceed to the point where benchmarks become > possible. I'm assuming that, like all other features added as > performance optimizations, it won't be committed until there are > benchmarks showing the net benefit. > > As a thought experiment, picture the relative costs of scanning a > portion of an index in index sequence, and being done, versus > scanning a portion of an index in index sequence and jumping to a > random heap access for each index entry as you go. I can picture that. Regrettably, I can also picture the accesses to the visibility map, the maintenance operations on the VM that are needed for this and the contention that both of those will cause. ISTM quite likely that we'll slow down writes to some extent in order to improve this use case. So I'm interested in knowing how broad the use case is and what the overheads are, rather than have an "aw crap!" moment in the future where we finish the code and only then realise its benefit footprint is not useful. Best to start out with a clear benefit analysis other than "other DBMS do it". For example, will this be an index-specific tuning option (manual/automatic), per table or an always-on feature? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Simon Riggs <simon@2ndQuadrant.com> wrote: >>> This topic has been discussed many times, yet I have never seen >>> an assessment that explains WHY we would want to do index-only >>> scans. > >> In databases with this feature, it's not too unusual for a query >> which uses just an index to run one or more orders of magnitude >> faster than a query which has to randomly access the heap for >> each index entry. That seems like enough evidence of its >> possible value in PostgreSQL to proceed to the point where >> benchmarks become possible. I'm assuming that, like all other >> features added as performance optimizations, it won't be >> committed until there are benchmarks showing the net benefit. > >> As a thought experiment, picture the relative costs of scanning a >> portion of an index in index sequence, and being done, versus >> scanning a portion of an index in index sequence and jumping to a >> random heap access for each index entry as you go. > > It's already the case that we'll flip over to a bitmap indexscan, > and thus get rid of most/all of the "random" page accesses, in > situations where this is likely to be a big win. Pointing to the > performance difference in databases that don't do that is > therefore not too convincing. Sure. Of course, if you're only accessing twenty thousand rows from a table containing fifty million rows, bitmap index scans could come out pretty close to random access times; but on the whole I agree that the scale of benefit in PostgreSQL won't tend to match what people see in other products. Note that my words were "enough evidence of its possible value in PostgreSQL to proceed to the point where benchmarks become possible." In particular, we might want to somehow try to make clear to people that the very wide indexes they are accustomed to creating to allow this optimization in other products might be inefficient compared to creating several one-column indexes which would enable bitmap logical operations. > I'm inclined to agree that index-only scans might be worth the > amount of work that's involved So we agree there. > ... but I share Simon's desire to see some proof before anything > gets committed. And we agree there. In fact, I can't think of anyone in the community who doesn't want to see that for *any* purported performance enhancement. My overall gut feel is that there will be some circumstances where the "covering index" optmization is much faster, and some where people expect it to be, but it isn't. The trickiest part of this might be developing a costing model which allows us to make the right choice most of the time. And even if we get it perfect, we can expect questions about why the covering index wasn't used, and requests for hints so they can force it. :-( -Kevin
On Tue, May 10, 2011 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's already the case that we'll flip over to a bitmap indexscan, > and thus get rid of most/all of the "random" page accesses, in > situations where this is likely to be a big win. Pointing to the > performance difference in databases that don't do that is therefore > not too convincing. The other major effect is row size. Many databases have very wide rows, perhaps on the order of 1kB. So the table with a million rows might be 8GB but the index on a few key columns might only be a few megabytes. Even if you have to read the entire index in random order it'll likely all be cached and scan faster than the table itself. One problem with hanging on benchmarks is that database schema design can actually change based on what performs well. People get in the habit of creating indexes in Oracle that are only logical when you realize they allow the database to do an index-only scan because they contain extra columns that aren't actually used in where clauses but are typically in the select list. -- greg
On Tue, May 10, 2011 at 6:25 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> ... but I share Simon's desire to see some proof before anything >> gets committed. > > And we agree there. In fact, I can't think of anyone in the > community who doesn't want to see that for *any* purported > performance enhancement. I'm not talking about eventual commit, I'm talking about the whole process of development. We should be focusing on improving a measurable performance issue, not on implementing one exact design that someone thought might help. How will we review the patch except by measuring it against the declared performance goal? Otherwise all the various options along the way will just be matters of opinion, instead of measurement. From what has been said so far, the use case for this is related to the practice of using "covered indexes", which makes me nervous because that is an expert level tuning task on other DBMS, limiting the range of people who get benefit. The typical speed up for non-covered indexes will come when we access a very large table (not in cache) via an index scan that is smaller than a bitmapindex scan. Will we be able to gauge selectivities sufficiently accurately to be able to pinpoint that during optimization? How will we know that the table is not in cache? Or is this an optimisation in the executor for a bitmapheap scan? I'm not being negative, I'm trying to avoid arguments, blind alleys and much wasted development if we focus on the wrong things or go to design too early.. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > >>> ... but I share Simon's desire to see some proof before anything >>> gets committed. >> >> And we agree there. In fact, I can't think of anyone in the >> community who doesn't want to see that for *any* purported >> performance enhancement. > > I'm not talking about eventual commit, I'm talking about the whole > process of development. I'm confused -- you want to see proof that the concept works well in PostgreSQL before development effort on it begins? Or there is some alternative you would like to see pursued instead? Something else? > From what has been said so far, the use case for this is related > to the practice of using "covered indexes", which makes me nervous > because that is an expert level tuning task on other DBMS What? On the versions of MS SQL Server and Sybase ASE I've used it costs covered index plans against all the other plans automatically, and picks this type of plan if the cost looks lower. Sure, DBAs sometimes add indexes, or add columns to indexes, in hopes that such a plan will be chosen -- but what's new and different there? > The typical speed up for non-covered indexes will come when we > access a very large table (not in cache) via an index scan that is > smaller than a bitmapindex scan. Will we be able to gauge > selectivities sufficiently accurately to be able to pinpoint that > during optimization? How will we know that the table is not in > cache? Or is this an optimisation in the executor for a bitmapheap > scan? I would continue to object to using current cache contents for plan choice because of plan instability and the fact that an odd initial cache load could skew plans in a bad direction indefinitely. I do agree (and have already posted) that I think the hardest part of this might be developing a good cost model. I doubt that's an insoluble problem, especially since it is something we can refine over time as we gain experience with the edge cases. -Kevin
On Tue, May 10, 2011 at 8:35 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: >> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> >>>> ... but I share Simon's desire to see some proof before anything >>>> gets committed. >>> >>> And we agree there. In fact, I can't think of anyone in the >>> community who doesn't want to see that for *any* purported >>> performance enhancement. >> >> I'm not talking about eventual commit, I'm talking about the whole >> process of development. > > I'm confused -- you want to see proof that the concept works well in > PostgreSQL before development effort on it begins? Or there is some > alternative you would like to see pursued instead? Something else? Well, I didn't ask for that and agree it would be foolish to demand proof ahead of development. I know this technique is effective in other DBMS, I just want to be certain it will be effective for us before too much work is done. We have the additional requirement for a crash safe vacuum map that needs to be consulted, with possible contention effects. Sybase etc can simply avoid the logical I/O, which is always a win, in or out of cache. So the problem is quite different for us. What I suggested was a assessment and benefit case because we normally start with a problem and then work out how to solve it. Normally, others come forward with the why? when? questions and it feels like there's a bit of groupthink going on here. This looks to me like its being approached like it was a feature, but it looks to me like a possible optimisation, so suggest we treat it that way. Out of concern, I don't want you to waste time on work that *may* not be that useful in practice, and I don't want to miss improvements or alternatives either. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > Normally, others come forward with the why? when? questions and it > feels like there's a bit of groupthink going on here. This looks > to me like its being approached like it was a feature, but it > looks to me like a possible optimisation, so suggest we treat it > that way. This issue has come up a great many times over the years, and there has been much discussion around it. The Wiki page is here: http://wiki.postgresql.org/wiki/Index-only_scans This currently references 11 threads on the topic. I'd bet that by spending a couple hours at it I could quadruple that number of threads. (I'd really rather not, though.) The problem is that there are regular and fairly frequent complaints on the list about queries which run slower than people expect because the heap must be checked for visibility information when matching index entries are found. It has become enough of a conspicuous issue that a lot of people are interested in seeing if something can be done about it. After much discussion, people are trying to advance a plan to find an answer. I'm sure nobody involved would ignore any suggestion on how it might be done better; but at this point, I don't think it's fair to suggest that this is not being pursued in response to a real problem, or that no serious thought has been given to direction before people started moving. -Kevin
On Wed, May 11, 2011 at 12:14 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > The problem is that there are regular and fairly frequent complaints > on the list about queries which run slower than people expect > To be fair about 3/4 of them were actually complaining about the lack of some global materialized cache of the aggregate value. Covering index-only scans are only going to be a linear speedup no matter how large the factor it's not going to turn select count(*) into a O(1) operation. I support the idea of thinking of this as an optimization. But I don't think there's much question. If we can avoid doing the i/o on the heap that's an obvious and huge win. Sure the costs of maintaining the vm need to be measured against the gains but it we don't know what those costs are yet and whoever works on it will be well aware of that balance. On a separate note though, Simon, I don't know what you mean by "we normally start with a problem". It's an free software project and people are free to work on whatever interests them whether that's because it solves a problem they have, helps a client who's paying them, or just because it's of academic interest to them. We don't always take their patches if they aren't of general interest but people propose all kinds of crazy experimental ideas all the time. -- greg
Robert Haas wrote: > >> Any thoughts welcome. ?Incidentally, if anyone else feels like working > >> on this, feel free to let me know and I'm happy to step away, from all > >> of it or from whatever part someone else wants to tackle. ?I'm mostly > >> working on this because it's something that I think we really need to > >> get done, more than having a burning desire to be the one who does it. > > > > Indexonly scans are welcome! > > I believe I can help on 3 and 4, but (really) not sure for 1 and 2. > > Well, I have code for #1, and just need reviews, and #2 shouldn't be > that hard, and with luck I'll twist Bruce's arm into doing it (*waves > to Bruce*). So #3 and #4 are the next thing to tackle. Any thoughts > on what/how you'd like to contribute there? I am happy to have pg_upgrade skip upgrading visibility map files --- it already has code to conditionally process them because they only exist in >= 8.4: /* fsm/vm files added in PG 8.4 */ if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804) { /* * Copy/link any fsm and vm files, if they exist */ Just give the word and it will be done. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Stark wrote: > On a separate note though, Simon, I don't know what you mean by "we > normally start with a problem". It's an free software project and > people are free to work on whatever interests them whether that's > because it solves a problem they have, helps a client who's paying > them, or just because it's of academic interest to them. We don't > always take their patches if they aren't of general interest but > people propose all kinds of crazy experimental ideas all the time. I am confused by Simon's questions too. Simon seems to regularly argue for adding features late in the development cycle and backpatch things no one else thinks should be backpatched, but he wants more research that index-only scans are going to improve things before it is implemented? The first is aggressive development, the second is very conservative development --- they don't match, so I now wonder what the motivation is since it isn't consistent. Isn't speeding up COUNT(*) a sufficient case because it will not have to touch the heap in many cases? No one is going to apply this patch until we fully understand the performance implications, just like every other patch. No one has suggested otherwise. It is helpful to have people critically review all our work, but disagreeing just for the sake of causing discussion isn't helpful, and I have seen a lot of these discussions lately. I am sensing a pattern. :-( -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > Isn't speeding up COUNT(*) a sufficient case because it will not have to > touch the heap in many cases? Putting aside the politics questions, count(*) is an interesting case -- it exposes some of the unanswered questions about index-only scans. The reason "select count(*)" might win would be because we could pick any index and do an index scan, relying on the visibility map to optimize away the heap reads. This is only going to be a win if a large fraction of the heap reads get optimized away. It's going to be pretty tricky to determine in the optimizer a) which index will be cheapest and b) what fraction of index tuples will point to pages where the heap reference can be optimized away. The penalty for guessing wrong if we use an index-only scan and it turns out to have many pages that aren't all-visible would be pretty high. -- greg
Greg Stark wrote: > On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Isn't speeding up COUNT(*) a sufficient case because it will not have to > > touch the heap in many cases? > > Putting aside the politics questions, count(*) is an interesting case > -- it exposes some of the unanswered questions about index-only scans. > > The reason "select count(*)" might win would be because we could pick > any index and do an index scan, relying on the visibility map to > optimize away the heap reads. This is only going to be a win if a > large fraction of the heap reads get optimized away. > > It's going to be pretty tricky to determine in the optimizer a) which > index will be cheapest and b) what fraction of index tuples will point > to pages where the heap reference can be optimized away. The penalty > for guessing wrong if we use an index-only scan and it turns out to > have many pages that aren't all-visible would be pretty high. Yes, that is the tricky optimizer/analyze part. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > So, what do we need in order to find our way to index-only scans? > > 1. The visibility map needs to be crash-safe. The basic idea of > index-only scans is that, instead of checking the heap to find out > whether each tuple is visible, we first check the visibility map. If > the visibility map bit is set, then we know all tuples on the page are > visible to all transactions, and therefore the tuple of interest is > visible to our transaction. Assuming that a significant number of > visibility map bits are set, this should enable us to avoid a fair > amount of I/O, especially on large tables, because the visibility map > is roughly 8000 times smaller than the heap, and therefore far more > practical to keep in cache. However, before we can rely on the FYI, because the visibility map is only one _bit_ per page, it is 8000 * 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of heap pages. This is important because we rely on this compactness in hope that the WAL logging of this information will not be burdensome. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Stark wrote: > On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Isn't speeding up COUNT(*) a sufficient case because it will not have to > > touch the heap in many cases? > > Putting aside the politics questions, count(*) is an interesting case > -- it exposes some of the unanswered questions about index-only scans. > > The reason "select count(*)" might win would be because we could pick > any index and do an index scan, relying on the visibility map to > optimize away the heap reads. This is only going to be a win if a > large fraction of the heap reads get optimized away. > > It's going to be pretty tricky to determine in the optimizer a) which > index will be cheapest and b) what fraction of index tuples will point I assume the smallest non-partial index would be the cheapest index. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > Greg Stark wrote: >> Putting aside the politics questions, count(*) is an interesting case >> -- it exposes some of the unanswered questions about index-only scans. >> >> The reason "select count(*)" might win would be because we could pick >> any index and do an index scan, relying on the visibility map to >> optimize away the heap reads. This is only going to be a win if a >> large fraction of the heap reads get optimized away. >> >> It's going to be pretty tricky to determine in the optimizer a) which >> index will be cheapest and b) what fraction of index tuples will point > I assume the smallest non-partial index would be the cheapest index. That will be true only if you intentionally ignore the points Greg raised. If the table isn't entirely ALL_VISIBLE, then the choice of index will determine the ordering of the actual table probes that occur. There could be more or fewer page reads, in a more or less optimal order, depending on the index used. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Greg Stark wrote: > >> Putting aside the politics questions, count(*) is an interesting case > >> -- it exposes some of the unanswered questions about index-only scans. > >> > >> The reason "select count(*)" might win would be because we could pick > >> any index and do an index scan, relying on the visibility map to > >> optimize away the heap reads. This is only going to be a win if a > >> large fraction of the heap reads get optimized away. > >> > >> It's going to be pretty tricky to determine in the optimizer a) which > >> index will be cheapest and b) what fraction of index tuples will point > > > I assume the smallest non-partial index would be the cheapest index. > > That will be true only if you intentionally ignore the points Greg > raised. If the table isn't entirely ALL_VISIBLE, then the choice of > index will determine the ordering of the actual table probes that occur. > There could be more or fewer page reads, in a more or less optimal > order, depending on the index used. OK, would the clustering analyze stats (pg_stats.correlation) tell us that? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 2011-05-11 01:54, Greg Stark wrote:<br /><blockquote cite="mid:BANLkTi=XvEA24K=LNNUg=SwyrxK-p-U0nw@mail.gmail.com" type="cite"><prewrap=""> To be fair about 3/4 of them were actually complaining about the lack of some global materialized cache of the aggregate value. Covering index-only scans are only going to be a linear speedup no matter how large the factor it's not going to turn select count(*) into a O(1) operation. </pre></blockquote> Actually, if we could get to count(*) into the situation of a <br /> very "thin row" today, so the costof visibillity-testing didn't depend<br /> hugely on the width of the row any more, then we be half-<br /> way-therein terms of performance optimization. <br /><br /> If rows typically just were tuple-headers + a bit more, thenit <br /> would be way harder to go down this road and claim good <br /> benefits. But currently the system needs todrag in "allmost" <br /> one page pr. visibillity test from disk on "random large tables". <br /><br /> I tried to graphthe differences of thin vs. wide rows here:<br /><a href="http://shrek.krogh.cc/%7Ejesper/visibillitytesting.pdf" rel="nofollow"target="_top"><span>http://shrek.<b class="highlight">krogh</b>.cc/~<b class="highlight">jesper</b>/<b class="highlight">visibillitytesting</b>.pdf</span></a><br/><a class="moz-txt-link-freetext" href="http://postgresql.1045698.n5.nabble.com/Visibillity-testing-some-numbers-on-current-performance-td4284836.html">http://postgresql.1045698.n5.nabble.com/Visibillity-testing-some-numbers-on-current-performance-td4284836.html</a><br /><br/> The getting the visibillitymap down to an O(n) is "on large tables"<br /> shifting to be memory based vs. disk-basedas now. <br /><br /> Jesper (It not a goal, but it would most-likely postpone some<br /> peoples needs for buyinga FusionIO card or similar)<br /> -- <br /> Jesper<br />
On Wed, May 11, 2011 at 12:54 AM, Greg Stark <gsstark@mit.edu> wrote: > On a separate note though, Simon, I don't know what you mean by "we > normally start with a problem". It's an free software project and > people are free to work on whatever interests them whether that's > because it solves a problem they have, helps a client who's paying > them, or just because it's of academic interest to them. We don't > always take their patches if they aren't of general interest but > people propose all kinds of crazy experimental ideas all the time. Completely agree, but why are you saying that to me? When Tom asks me why I suggest something, nobody tells him "its a free software project etc...". What is the difference here? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > Greg Stark wrote: >> On a separate note though, Simon, I don't know what you mean by "we >> normally start with a problem". It's an free software project and >> people are free to work on whatever interests them whether that's >> because it solves a problem they have, helps a client who's paying >> them, or just because it's of academic interest to them. We don't >> always take their patches if they aren't of general interest but >> people propose all kinds of crazy experimental ideas all the time. > > I am confused by Simon's questions too. > > Simon seems to regularly argue for adding features late in the > development cycle and backpatch things no one else thinks should be > backpatched, but he wants more research that index-only scans are going > to improve things before it is implemented? The first is aggressive > development, the second is very conservative development --- they don't > match, so I now wonder what the motivation is since it isn't consistent. Not really sure why reasonable technical skepticism should become personal commentary. You don't question Tom's motives if he is skeptical of an idea of mine. Why would you question my motivation? What is *your* motive for acting like that? I'm not driven by one setting of "conservatism", but I am interested in adding fully usable features that bring credit to the project. If I see a feature that can have minor things added to it to improve them, then I raise that during beta. If I see things being worked out that sounds dubious, I mention that in early development. I don't think this work will materially improve the speed of count(*) in majority of cases. This introduces extra overhead into the code path and that can be a net loss. The only time it will help is when you have a large table that is not cached and also not recently updated. Is count(*) run very often against such tables? Do we really care enough to optimise that use case with lots of special purpose code? The very fact that Kevin and yourself bring up different reasons for why we need this feature makes me nervous. The analysis has not been done yet, and all I have done is request that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 11, 2011 at 2:34 AM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> So, what do we need in order to find our way to index-only scans? >> >> 1. The visibility map needs to be crash-safe. The basic idea of >> index-only scans is that, instead of checking the heap to find out >> whether each tuple is visible, we first check the visibility map. If >> the visibility map bit is set, then we know all tuples on the page are >> visible to all transactions, and therefore the tuple of interest is >> visible to our transaction. Assuming that a significant number of >> visibility map bits are set, this should enable us to avoid a fair >> amount of I/O, especially on large tables, because the visibility map >> is roughly 8000 times smaller than the heap, and therefore far more >> practical to keep in cache. However, before we can rely on the > > FYI, because the visibility map is only one _bit_ per page, it is 8000 * > 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of > heap pages. This is important because we rely on this compactness in > hope that the WAL logging of this information will not be burdensome. We would need to issue one WAL record per bit, not per page. I'm concerned about the path length added by VM visits and the potential contention that concentration of information will bring. Those aren't things to be dismissed without calculation and analysis. There might not be an issue there, but its worth checking. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10.05.2011 20:15, Simon Riggs wrote: > On Tue, May 10, 2011 at 5:17 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Simon Riggs<simon@2ndQuadrant.com> wrote: >> >>> This topic has been discussed many times, yet I have never seen an >>> assessment that explains WHY we would want to do index-only scans. >> >> In databases with this feature, it's not too unusual for a query >> which uses just an index to run one or more orders of magnitude >> faster than a query which has to randomly access the heap for each >> index entry. That seems like enough evidence of its possible value >> in PostgreSQL to proceed to the point where benchmarks become >> possible. I'm assuming that, like all other features added as >> performance optimizations, it won't be committed until there are >> benchmarks showing the net benefit. >> >> As a thought experiment, picture the relative costs of scanning a >> portion of an index in index sequence, and being done, versus >> scanning a portion of an index in index sequence and jumping to a >> random heap access for each index entry as you go. > > I can picture that. Regrettably, I can also picture the accesses to > the visibility map, the maintenance operations on the VM that are > needed for this and the contention that both of those will cause. Note that we already have the visibility map, and the accesses needed to update it are already there. Granted, we'll have to change the logic slightly to make it crash safe, but I don't expect that to add any meaningful overhead - the changes are going to be where the bits are set, ie. vacuum, not when the bits are cleared. Granted, we might also want to set the bits more aggressively once they're used by index-only-scans. But done correctly, just taking advantage of the VM that's already there shouldn't add overhead to other operations. I agree that we need to do tests to demonstrate that there's a gain from the patch, once we have a patch to test. I would be very surprised if there isn't, but that just means the testing is going to be easy. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2011/5/11, Bruce Momjian <bruce@momjian.us>: > FYI, because the visibility map is only one _bit_ per page, it is 8000 * > 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of > heap pages. Actually, that would be "one 8kB block covers 512MB of heap": 1 block of visibility map (8kB) = 64k visibility bits = covers 64k blocks = covers 512MB of heap. The cost of keeping the visibility map in cache is therefore totally negligible, only the cost of WAL logging changes to it is of interest. > This is important because we rely on this compactness in hope that > the WAL logging of this information will not be burdensome. The size of on entry in the map (1 bit) is not very related to the WAL overhead required per change of such a bit (i.e., the log record for a 1 bit change will certainly be way more than 1 bit). Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
2011/5/11 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > On 10.05.2011 20:15, Simon Riggs wrote: >> >> On Tue, May 10, 2011 at 5:17 PM, Kevin Grittner >> <Kevin.Grittner@wicourts.gov> wrote: >>> >>> Simon Riggs<simon@2ndQuadrant.com> wrote: >>> >>>> This topic has been discussed many times, yet I have never seen an >>>> assessment that explains WHY we would want to do index-only scans. >>> >>> In databases with this feature, it's not too unusual for a query >>> which uses just an index to run one or more orders of magnitude >>> faster than a query which has to randomly access the heap for each >>> index entry. That seems like enough evidence of its possible value >>> in PostgreSQL to proceed to the point where benchmarks become >>> possible. I'm assuming that, like all other features added as >>> performance optimizations, it won't be committed until there are >>> benchmarks showing the net benefit. >>> >>> As a thought experiment, picture the relative costs of scanning a >>> portion of an index in index sequence, and being done, versus >>> scanning a portion of an index in index sequence and jumping to a >>> random heap access for each index entry as you go. >> >> I can picture that. Regrettably, I can also picture the accesses to >> the visibility map, the maintenance operations on the VM that are >> needed for this and the contention that both of those will cause. > > Note that we already have the visibility map, and the accesses needed to > update it are already there. Granted, we'll have to change the logic > slightly to make it crash safe, but I don't expect that to add any > meaningful overhead - the changes are going to be where the bits are set, > ie. vacuum, not when the bits are cleared. Granted, we might also want to > set the bits more aggressively once they're used by index-only-scans. But > done correctly, just taking advantage of the VM that's already there > shouldn't add overhead to other operations. We won't be able to do index-only scan. We can do index scan with probability to not access heap, maybe(hopefully) completely in some cases. IF vis map is ok to remove the need to access heap (perf and safe), then, for the cost part: Currently, cost_index materialize the cost to access each heap page by a random_page_cost. I believe we should be able to change that to remove the estimated number of heap page we don't need to access (can be 100% or 0.1%). And as suggested Simon, there is also maybe a path to improve the bitmapheap scan. bitmapheap scan have already some workaround to be sure indexscan looks cheaper in some case, just keep that and apply same logic than for cost_index. This is keeping the same rule PostgreSQL has : let the planner decide the best solution instead of allowing special index declaration (it hasn't been propose yet I think, but, well, just in case it pops into the mind of someone) > > I agree that we need to do tests to demonstrate that there's a gain from the > patch, once we have a patch to test. I would be very surprised if there > isn't, but that just means the testing is going to be easy. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2011/5/10 Kevin Grittner <Kevin.Grittner@wicourts.gov>: > Simon Riggs <simon@2ndQuadrant.com> wrote: >> The typical speed up for non-covered indexes will come when we >> access a very large table (not in cache) via an index scan that is >> smaller than a bitmapindex scan. Will we be able to gauge >> selectivities sufficiently accurately to be able to pinpoint that >> during optimization? How will we know that the table is not in >> cache? Or is this an optimisation in the executor for a bitmapheap >> scan? > > I would continue to object to using current cache contents for plan > choice because of plan instability and the fact that an odd initial > cache load could skew plans in a bad direction indefinitely. I do > agree (and have already posted) that I think the hardest part of > this might be developing a good cost model. I doubt that's an > insoluble problem, especially since it is something we can refine > over time as we gain experience with the edge cases. you will have the same possible instability in planning with the index(-only?) scan because we may need to access heap anyway and this needs is based on estimation, or I miss something ? I understood the idea was just to bypass the heap access *if* we can for *this* heap-page. In reality, I am not really scared by plan instability because of a possible PG/OS cache estimation. The percentages remain stable in my observations ... I don't know yet how it will go for vis map. And, we already have plan instability currently, which is *good* : at some point a seq scan is better than an bitmap heap scan. Because the relation size change and because ANALYZE re-estimate the distribution of the data. I will be very happy to issue ANALYZE CACHE as I have to ANALYZE temp table for large query if it allows the planner to provide me the best plan in some scenario...but this is another topic, sorry for the digression.. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Tue, May 10, 2011 at 9:34 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> So, what do we need in order to find our way to index-only scans? >> >> 1. The visibility map needs to be crash-safe. The basic idea of >> index-only scans is that, instead of checking the heap to find out >> whether each tuple is visible, we first check the visibility map. If >> the visibility map bit is set, then we know all tuples on the page are >> visible to all transactions, and therefore the tuple of interest is >> visible to our transaction. Assuming that a significant number of >> visibility map bits are set, this should enable us to avoid a fair >> amount of I/O, especially on large tables, because the visibility map >> is roughly 8000 times smaller than the heap, and therefore far more >> practical to keep in cache. However, before we can rely on the > > FYI, because the visibility map is only one _bit_ per page, it is 8000 * > 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of > heap pages. This is important because we rely on this compactness in > hope that the WAL logging of this information will not be burdensome. I accuse you of bad math. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2011 at 10:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That will be true only if you intentionally ignore the points Greg > raised. If the table isn't entirely ALL_VISIBLE, then the choice of > index will determine the ordering of the actual table probes that occur. > There could be more or fewer page reads, in a more or less optimal > order, depending on the index used. However, note that this wasn't one of the cases I said I was going to try to optimize in the first go-around anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 11, 2011 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Completely agree, but why are you saying that to me? > > When Tom asks me why I suggest something, nobody tells him "its a free > software project etc...". > > What is the difference here? We're now 40 emails in this thread, and there seems to be far more heat than light here. Here's an attempt at a summary: - Simon wants proof that the performance benefit of this feature is worth any downsides it may have, which is standard procedure, and isn't certain the feature will have a significant performance benefit. - Numerous other people think Simon's doubts about the feature are poorly justified (and some of them also think he's being a pain in the neck). - Various peripherally related topics, such as optimizing count(*), which is not part of the vision for the first go-round that I sketched out in my OP, and plan stability, which is another issue entirely, have been discussed. - Meanwhile, only one person has done any review of the actual code that's been written, which is posted on the crash-safe visibility map thread, which may be why multiple people seem confused about what it does. - And no one has done any benchmarking of that code. I think it would be really helpful if some more people would review the crash-safe visibility map patch, and if at least one person could benchmark it. It would be useful to know (a) whether that noticeably slows down the system during inserts, updates, and deletes, especially at very high concurrency; and (b) how much of an impact the additional WAL-logging has on VACUUM. On the other hand, arguing about whether index-only scans are going to result in a significant performance benefit is not useful. I am going to be both surprised and disappointed if they don't, but there's only one way to find out, and a theoretical argument isn't it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Nicolas Barbier wrote: > 2011/5/11, Bruce Momjian <bruce@momjian.us>: > > > FYI, because the visibility map is only one _bit_ per page, it is 8000 * > > 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of > > heap pages. > > Actually, that would be "one 8kB block covers 512MB of heap": 1 block > of visibility map (8kB) = 64k visibility bits = covers 64k blocks = > covers 512MB of heap. The cost of keeping the visibility map in cache > is therefore totally negligible, only the cost of WAL logging changes > to it is of interest. Ah, yes, thanks, even better. > > This is important because we rely on this compactness in hope that > > the WAL logging of this information will not be burdensome. > > The size of on entry in the map (1 bit) is not very related to the WAL > overhead required per change of such a bit (i.e., the log record for a > 1 bit change will certainly be way more than 1 bit). True. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Simon Riggs wrote: > On Wed, May 11, 2011 at 1:47 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Greg Stark wrote: > >> On a separate note though, Simon, I don't know what you mean by "we > >> normally start with a problem". It's an free software project and > >> people are free to work on whatever interests them whether that's > >> because it solves a problem they have, helps a client who's paying > >> them, or just because it's of academic interest to them. We don't > >> always take their patches if they aren't of general interest but > >> people propose all kinds of crazy experimental ideas all the time. > > > > I am confused by Simon's questions too. > > > > Simon seems to regularly argue for adding features late in the > > development cycle and backpatch things no one else thinks should be > > backpatched, but he wants more research that index-only scans are going > > to improve things before it is implemented? ? The first is aggressive > > development, the second is very conservative development --- they don't > > match, so I now wonder what the motivation is since it isn't consistent. > > Not really sure why reasonable technical skepticism should become > personal commentary. > > You don't question Tom's motives if he is skeptical of an idea of > mine. Why would you question my motivation? What is *your* motive for > acting like that? Tom is consistent in his level of aggressive/conservative development suggestions. What I am seeing are many cases where you are consistently pushing for something even though you get almost-overwhelming rejection, and you keep going. And if it was consistent in one direction, I could understand because maybe you feel we are too conservative, but if it isn't consistent, I have no idea how to learn or adjust to your approach. We clearly have some people on one side of the conservative/agressive specturm, and some on the other side. Now, I am willing to admit I might be totally wrong, but it has risen to a level that I felt I should say something in case it is helpful. > I'm not driven by one setting of "conservatism", but I am interested > in adding fully usable features that bring credit to the project. If I > see a feature that can have minor things added to it to improve them, > then I raise that during beta. If I see things being worked out that > sounds dubious, I mention that in early development. Yes, that seems fine to me, as stated. > I don't think this work will materially improve the speed of count(*) > in majority of cases. This introduces extra overhead into the code I think this is the only hope we have of improving count(*) in an active MVCC system. It might not work, but it has been our only hope of improvement of count(*) for a while. > path and that can be a net loss. The only time it will help is when > you have a large table that is not cached and also not recently > updated. Is count(*) run very often against such tables? Do we really > care enough to optimise that use case with lots of special purpose > code? The very fact that Kevin and yourself bring up different reasons > for why we need this feature makes me nervous. Yes, no question. For count(*), you don't care about the indexed values, only the count, while for Kevin's case you are reading values from the index. I assume (or hope) that one or both will be a win for this feature. > The analysis has not been done yet, and all I have done is request that. I think we are going to have to write the code and see the performance hit and where it is a win. Ideally we could figure this out before-hand, but I don't think that is possible in this case. If you look at the research in reducing the load of updating the hint bits, again, it is so complex that only working code and testing is showing if there is possible improvement there. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
C�dric Villemain wrote: > 2011/5/10 Kevin Grittner <Kevin.Grittner@wicourts.gov>: > > Simon Riggs <simon@2ndQuadrant.com> wrote: > >> The typical speed up for non-covered indexes will come when we > >> access a very large table (not in cache) via an index scan that is > >> smaller than a bitmapindex scan. Will we be able to gauge > >> selectivities sufficiently accurately to be able to pinpoint that > >> during optimization? How will we know that the table is not in > >> cache? Or is this an optimisation in the executor for a bitmapheap > >> scan? > > > > I would continue to object to using current cache contents for plan > > choice because of plan instability and the fact that an odd initial > > cache load could skew plans in a bad direction indefinitely. ?I do > > agree (and have already posted) that I think the hardest part of > > this might be developing a good cost model. ?I doubt that's an > > insoluble problem, especially since it is something we can refine > > over time as we gain experience with the edge cases. > > you will have the same possible instability in planning with the > index(-only?) scan because we may need to access heap anyway and this > needs is based on estimation, or I miss something ? I understood the > idea was just to bypass the heap access *if* we can for *this* > heap-page. > > In reality, I am not really scared by plan instability because of a > possible PG/OS cache estimation. The percentages remain stable in my > observations ... I don't know yet how it will go for vis map. > > And, we already have plan instability currently, which is *good* : at > some point a seq scan is better than an bitmap heap scan. Because the > relation size change and because ANALYZE re-estimate the distribution > of the data. I will be very happy to issue ANALYZE CACHE as I have to > ANALYZE temp table for large query if it allows the planner to provide > me the best plan in some scenario...but this is another topic, sorry > for the digression.. Good point --- we would be making plan decisions based on the visibility map coverage. The big question is whether visibility map changes are more dynamic than the values we already plan against, like rows in the table, table size, and value distributions. I don't know the answer. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> wrote: >> The very fact that Kevin and yourself bring up different reasons >> for why we need this feature makes me nervous. > > Yes, no question. For count(*), you don't care about the indexed > values, only the count, while for Kevin's case you are reading > values from the index. [sigh] I'm reluctant to draw out this digression further, but there is a possibly-useful point to be made here: these are not two different things. A covering index can be considered whenever the set of columns referenced in the query is contained inside the set of columns in the index. The fact that the set of columns needed by count(*) is the empty set merely means that it is covered by any index, since the empty set is contained in every set. Now, this special case may make for an easy initial target in implementation, or allow early benchmarking. If so, all the better to go there first. I'm not sure why anyone would stop there, though; if it pays off for that simple case it is likely to pay off for the more general case, too. -Kevin
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 10.05.2011 20:15, Simon Riggs wrote: >> I can picture that. Regrettably, I can also picture the accesses to >> the visibility map, the maintenance operations on the VM that are >> needed for this and the contention that both of those will cause. > I agree that we need to do tests to demonstrate that there's a gain from > the patch, once we have a patch to test. I would be very surprised if > there isn't, but that just means the testing is going to be easy. I think Simon's point is that showing a gain on specific test cases isn't a sufficient argument. What we need to know about this sort of change is what is the distributed overhead that is going to be paid by *everybody*, whether their queries benefit from the optimization or not. And what fraction of real-world queries really do benefit, and to what extent. Isolated test cases (undoubtedly chosen to show off the optimization) are not adequate to form a picture of the overall cost and benefit. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > On 10.05.2011 20:15, Simon Riggs wrote: > >> I can picture that. Regrettably, I can also picture the accesses to > >> the visibility map, the maintenance operations on the VM that are > >> needed for this and the contention that both of those will cause. > > > I agree that we need to do tests to demonstrate that there's a gain from > > the patch, once we have a patch to test. I would be very surprised if > > there isn't, but that just means the testing is going to be easy. > > I think Simon's point is that showing a gain on specific test cases > isn't a sufficient argument. What we need to know about this sort of > change is what is the distributed overhead that is going to be paid by > *everybody*, whether their queries benefit from the optimization or not. > And what fraction of real-world queries really do benefit, and to what > extent. Isolated test cases (undoubtedly chosen to show off the > optimization) are not adequate to form a picture of the overall cost and > benefit. Yes, I assume we are going to need the same kind of tests we did for other invasive patches like serializable isolation level and hot standby. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think Simon's point is that showing a gain on specific test > cases isn't a sufficient argument. Ah, if that's what he's been trying to get at, I'm curious who disagrees with that. I wouldn't have thought anyone on this list would. > What we need to know about this sort of change is what is the > distributed overhead that is going to be paid by *everybody*, > whether their queries benefit from the optimization or not. Certainly we need to test whether Heikki is right in the previously non-quoted part of his post on this thread: >> Note that we already have the visibility map, and the accesses >> needed to update it are already there. Granted, we'll have to >> change the logic slightly to make it crash safe, but I don't >> expect that to add any meaningful overhead - the changes are >> going to be where the bits are set, ie. vacuum, not when the bits >> are cleared. Granted, we might also want to set the bits more >> aggressively once they're used by index-only-scans. But done >> correctly, just taking advantage of the VM that's already there >> shouldn't add overhead to other operations. > Isolated test cases (undoubtedly chosen to show off the > optimization) are not adequate to form a picture of the overall > cost and benefit. Well, first, that hardly seems fair. I have many times seen people make an effort to synthesize *worst* case benchmarks. Certainly any regular on this list would know it is pointless to show only a best case benchmark. Second, we really need to make development of a performance testing farm a priority at PGCon next week. The need for it just keeps coming up over and over. Third, Dan Ports has been working a great deal with DBT-2 running PostgreSQL for the SSI patch, both as a stress tool to flush out bugs and to get benchmarks numbers conforming to the published requirements of that benchmark. I know from off-list emails that it took a fair amount of work to get it running correctly with PostgreSQL in his environment. We should probably try to draw on that experience. (Of course that shouldn't be the *only* test in a performance testing farm, but it's a good one to include.) -Kevin
2011/5/11 Robert Haas <robertmhaas@gmail.com>: > On Wed, May 11, 2011 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Completely agree, but why are you saying that to me? >> >> When Tom asks me why I suggest something, nobody tells him "its a free >> software project etc...". >> >> What is the difference here? > > We're now 40 emails in this thread, and there seems to be far more > heat than light here. Here's an attempt at a summary: > > - Simon wants proof that the performance benefit of this feature is > worth any downsides it may have, which is standard procedure, and > isn't certain the feature will have a significant performance benefit. > - Numerous other people think Simon's doubts about the feature are > poorly justified (and some of them also think he's being a pain in the > neck). > - Various peripherally related topics, such as optimizing count(*), > which is not part of the vision for the first go-round that I sketched > out in my OP, and plan stability, which is another issue entirely, > have been discussed. > - Meanwhile, only one person has done any review of the actual code > that's been written, which is posted on the crash-safe visibility map > thread, which may be why multiple people seem confused about what it > does. > - And no one has done any benchmarking of that code. Will you be able to do some ? or can you propose a simple process to do efficient benchmark of the patch ? If reviewers agree it is safe and benchmarks make evidences that no basic performance issue are raised, then let's see if next items have blockers or can be done. > > I think it would be really helpful if some more people would review > the crash-safe visibility map patch, and if at least one person could > benchmark it. It would be useful to know (a) whether that noticeably > slows down the system during inserts, updates, and deletes, especially > at very high concurrency; and (b) how much of an impact the additional > WAL-logging has on VACUUM. On the other hand, arguing about whether > index-only scans are going to result in a significant performance > benefit is not useful. I am going to be both surprised and > disappointed if they don't, but there's only one way to find out, and > a theoretical argument isn't it. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Fri, May 13, 2011 at 6:34 PM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > Will you be able to do some ? or can you propose a simple process to > do efficient benchmark of the patch ? I will probably do some benchmarking at some point, unless someone else goes nuts and makes it moot before I get to that point. I think the main thing is to just apply the patch and beat up the server, and see if it's any slower than it was before. > If reviewers agree it is safe and benchmarks make evidences that no > basic performance issue are raised, then let's see if next items have > blockers or can be done. Sounds right to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/14 Robert Haas <robertmhaas@gmail.com>: > On Fri, May 13, 2011 at 6:34 PM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >> If reviewers agree it is safe and benchmarks make evidences that no >> basic performance issue are raised, then let's see if next items have >> blockers or can be done. > > Sounds right to me. and recent stuff on VM will allow us to move forward it seems ! -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2011/5/11 Bruce Momjian <bruce@momjian.us>: > Cédric Villemain wrote: >> 2011/5/10 Kevin Grittner <Kevin.Grittner@wicourts.gov>: >> > Simon Riggs <simon@2ndQuadrant.com> wrote: >> >> The typical speed up for non-covered indexes will come when we >> >> access a very large table (not in cache) via an index scan that is >> >> smaller than a bitmapindex scan. Will we be able to gauge >> >> selectivities sufficiently accurately to be able to pinpoint that >> >> during optimization? How will we know that the table is not in >> >> cache? Or is this an optimisation in the executor for a bitmapheap >> >> scan? >> > >> > I would continue to object to using current cache contents for plan >> > choice because of plan instability and the fact that an odd initial >> > cache load could skew plans in a bad direction indefinitely. ?I do >> > agree (and have already posted) that I think the hardest part of >> > this might be developing a good cost model. ?I doubt that's an >> > insoluble problem, especially since it is something we can refine >> > over time as we gain experience with the edge cases. >> >> you will have the same possible instability in planning with the >> index(-only?) scan because we may need to access heap anyway and this >> needs is based on estimation, or I miss something ? I understood the >> idea was just to bypass the heap access *if* we can for *this* >> heap-page. >> >> In reality, I am not really scared by plan instability because of a >> possible PG/OS cache estimation. The percentages remain stable in my >> observations ... I don't know yet how it will go for vis map. >> >> And, we already have plan instability currently, which is *good* : at >> some point a seq scan is better than an bitmap heap scan. Because the >> relation size change and because ANALYZE re-estimate the distribution >> of the data. I will be very happy to issue ANALYZE CACHE as I have to >> ANALYZE temp table for large query if it allows the planner to provide >> me the best plan in some scenario...but this is another topic, sorry >> for the digression.. > > Good point --- we would be making plan decisions based on the visibility > map coverage. The big question is whether visibility map changes are > more dynamic than the values we already plan against, like rows in the > table, table size, and value distributions. I don't know the answer. > Robert, I though of Covered-Index as just a usage of the vis map: don't take the heap block if not needed. This look easier to do and better in the long term (because read-only DB may quickly turn into a no-heap access DB for example). Thus this is not real covered-index. Did you want to implement real covered-index and did you have ideas on how to do that ? Or just an optimization of the current planner/executor on index usage ? ___ I don't know VM internals: * do we have a counter of ALL_VISIBLE flag set on a relation ? (this should be very good for planner)* do we need a pg_class.rel_vmvisible ?! (I have hands up, don't shoot pleeaase)* is it ok to parse VM for planning (I believe it is not) ? Ideas ? -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Sun, Jun 19, 2011 at 10:44 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > and recent stuff on VM will allow us to move forward it seems ! Yep, looks promising. The heap_hot_search_buffer refactoring patch is related as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 19, 2011 at 11:12 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: >> Good point --- we would be making plan decisions based on the visibility >> map coverage. The big question is whether visibility map changes are >> more dynamic than the values we already plan against, like rows in the >> table, table size, and value distributions. I don't know the answer. > > Robert, I though of Covered-Index as just a usage of the vis map: > don't take the heap block if not needed. This look easier to do and > better in the long term (because read-only DB may quickly turn into a > no-heap access DB for example). Thus this is not real covered-index. > Did you want to implement real covered-index and did you have ideas on > how to do that ? Or just an optimization of the current > planner/executor on index usage ? If by a "real" covered index you mean one that includes visibility info in the index - I have no plans to work on anything like that. If we were to do that, the index would become much larger and less efficient, whereas the approach of just optimizing the way our existing indexes are used doesn't have that disadvantage. It also sounds like a lot of work. Now, if someone else wants to demonstrate that it has advantages that are worth the costs and go do it, more power to said person, but I'm unexcited about it. > I don't know VM internals: > > * do we have a counter of ALL_VISIBLE flag set on a relation ? (this > should be very good for planner) > * do we need a pg_class.rel_vmvisible ?! (I have hands up, don't > shoot pleeaase) Evidently I'm developing a more frightening reputation than I would hope. :-( Anyway, yes, I do believe we need a table-level statistic for the percentage of the visibility map bits that are believed to be set. Having said that I think we need it, let me also say that I'm a bit skeptical about how well it will work. There are two problems: 1. Consider a query like "SELECT a, b FROM foo WHERE a = 1". To accurately estimate the cost of executing this query via an index-only scan (on an index over foo (a, b)), we need to know (i) the percentage of rows in the table for which a = 1 and (ii) the percentage *of those rows* which are on an all-visible page. We can assume that if 80% of the rows in the table are on all-visible pages, then 80% of the rows returned by this query will be on all-visible pages also, but that might be wildly wrong. This is similar to the problem of costing "SELECT * FROM foo WHERE a = 1 AND b = 1" - we know the fraction of rows where a = 1 and the fraction where b = 1, but there's no certainty that multiplying those values will produce an accurate estimate for the conjunction of those conditions. The problem here is not as bad as the general multi-column statistics problem because a mistake will only bollix the cost, not the row count estimate, but it's still not very nice. 2. Since VACUUM and ANALYZE often run together, we will be estimating the percentage of rows on all-visible pages just at the time when that percentage is highest. This is not exactly wonderful, either... I have a fair amount of hope that even with these problems we can come up with some adjustment to the planner that is better than just ignoring the problem, but I am not sure how difficult it will be. > * is it ok to parse VM for planning (I believe it is not) ? It doesn't seem like a good idea to me, but I just work here. I'm not sure what that would buy us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/6/19 Robert Haas <robertmhaas@gmail.com>: > On Sun, Jun 19, 2011 at 11:12 AM, Cédric Villemain > <cedric.villemain.debian@gmail.com> wrote: >>> Good point --- we would be making plan decisions based on the visibility >>> map coverage. The big question is whether visibility map changes are >>> more dynamic than the values we already plan against, like rows in the >>> table, table size, and value distributions. I don't know the answer. >> >> Robert, I though of Covered-Index as just a usage of the vis map: >> don't take the heap block if not needed. This look easier to do and >> better in the long term (because read-only DB may quickly turn into a >> no-heap access DB for example). Thus this is not real covered-index. >> Did you want to implement real covered-index and did you have ideas on >> how to do that ? Or just an optimization of the current >> planner/executor on index usage ? > > If by a "real" covered index you mean one that includes visibility > info in the index - I have no plans to work on anything like that. If > we were to do that, the index would become much larger and less > efficient, whereas the approach of just optimizing the way our > existing indexes are used doesn't have that disadvantage. It also > sounds like a lot of work. Now, if someone else wants to demonstrate > that it has advantages that are worth the costs and go do it, more > power to said person, but I'm unexcited about it. Yes I was thinking of that, and agree with you. > >> I don't know VM internals: >> >> * do we have a counter of ALL_VISIBLE flag set on a relation ? (this >> should be very good for planner) >> * do we need a pg_class.rel_vmvisible ?! (I have hands up, don't >> shoot pleeaase) > > Evidently I'm developing a more frightening reputation than I would hope. :-( Nah, I was joking :) don't worry ! Probably because I have already proposing 1 new GUC and at least one new column to pg_class recently. (and we're not used to change that frequently) > > Anyway, yes, I do believe we need a table-level statistic for the > percentage of the visibility map bits that are believed to be set. > Having said that I think we need it, let me also say that I'm a bit > skeptical about how well it will work. There are two problems: > > 1. Consider a query like "SELECT a, b FROM foo WHERE a = 1". To > accurately estimate the cost of executing this query via an index-only > scan (on an index over foo (a, b)), we need to know (i) the percentage > of rows in the table for which a = 1 and (ii) the percentage *of those > rows* which are on an all-visible page. We can assume that if 80% of > the rows in the table are on all-visible pages, then 80% of the rows > returned by this query will be on all-visible pages also, but that > might be wildly wrong. This is similar to the problem of costing > "SELECT * FROM foo WHERE a = 1 AND b = 1" - we know the fraction of > rows where a = 1 and the fraction where b = 1, but there's no > certainty that multiplying those values will produce an accurate > estimate for the conjunction of those conditions. The problem here is > not as bad as the general multi-column statistics problem because a > mistake will only bollix the cost, not the row count estimate, but > it's still not very nice. > > 2. Since VACUUM and ANALYZE often run together, we will be estimating > the percentage of rows on all-visible pages just at the time when that > percentage is highest. This is not exactly wonderful, either... > > I have a fair amount of hope that even with these problems we can come > up with some adjustment to the planner that is better than just > ignoring the problem, but I am not sure how difficult it will be. > >> * is it ok to parse VM for planning (I believe it is not) ? > > It doesn't seem like a good idea to me, but I just work here. I'm not > sure what that would buy us. All true, and I won't be unhappy to have the feature as a bonus, not expected by the planner(for the cost part) but handled by the executor. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Jun19, 2011, at 20:40 , Robert Haas wrote: > 2. Since VACUUM and ANALYZE often run together, we will be estimating > the percentage of rows on all-visible pages just at the time when that > percentage is highest. This is not exactly wonderful, either... Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than VACUUM by default? best regards, Florian Pflug
On Sun, Jun 19, 2011 at 5:10 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jun19, 2011, at 20:40 , Robert Haas wrote: >> 2. Since VACUUM and ANALYZE often run together, we will be estimating >> the percentage of rows on all-visible pages just at the time when that >> percentage is highest. This is not exactly wonderful, either... > > Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than > VACUUM by default? The autoanalyze threshold is, by default, 10%; and the autovacuum threshold, 20%. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun19, 2011, at 23:16 , Robert Haas wrote: > On Sun, Jun 19, 2011 at 5:10 PM, Florian Pflug <fgp@phlo.org> wrote: >> On Jun19, 2011, at 20:40 , Robert Haas wrote: >>> 2. Since VACUUM and ANALYZE often run together, we will be estimating >>> the percentage of rows on all-visible pages just at the time when that >>> percentage is highest. This is not exactly wonderful, either... >> >> Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than >> VACUUM by default? > > The autoanalyze threshold is, by default, 10%; and the autovacuum > threshold, 20%. Hm, so you could ignore (or rather dampen) the results of VACUUM+ANALYZE and rely on the ANALYZE-only runs to keep the estimate correct. Still doesn't sound that bad... best regards, Florian Pflug
On Sun, Jun 19, 2011 at 7:59 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jun19, 2011, at 23:16 , Robert Haas wrote: >> On Sun, Jun 19, 2011 at 5:10 PM, Florian Pflug <fgp@phlo.org> wrote: >>> On Jun19, 2011, at 20:40 , Robert Haas wrote: >>>> 2. Since VACUUM and ANALYZE often run together, we will be estimating >>>> the percentage of rows on all-visible pages just at the time when that >>>> percentage is highest. This is not exactly wonderful, either... >>> >>> Hm, doesn't autovacuum run ANALYZE quite a bit more frequently than >>> VACUUM by default? >> >> The autoanalyze threshold is, by default, 10%; and the autovacuum >> threshold, 20%. > > Hm, so you could ignore (or rather dampen) the results of > VACUUM+ANALYZE and rely on the ANALYZE-only runs to keep > the estimate correct. Still doesn't sound that bad... Yeah, there are a lots of possible approaches. You could try to keep a count of how many visibility map bits had been cleared since the last run... and either adjust the estimate directly or use it to trigger an ANALYZE (or some limited ANALYZE that only looks at visibility map bits). You could gather statistics on how often the queries that are actually running are finding the relevant visibility map bits set, and use that to plan future queries. You could do what you're suggesting... and there are probably other options as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2011 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> >> Any thoughts welcome. ?Incidentally, if anyone else feels like working >> >> on this, feel free to let me know and I'm happy to step away, from all >> >> of it or from whatever part someone else wants to tackle. ?I'm mostly >> >> working on this because it's something that I think we really need to >> >> get done, more than having a burning desire to be the one who does it. >> > >> > Indexonly scans are welcome! >> > I believe I can help on 3 and 4, but (really) not sure for 1 and 2. >> >> Well, I have code for #1, and just need reviews, and #2 shouldn't be >> that hard, and with luck I'll twist Bruce's arm into doing it (*waves >> to Bruce*). So #3 and #4 are the next thing to tackle. Any thoughts >> on what/how you'd like to contribute there? > > I am happy to have pg_upgrade skip upgrading visibility map files --- it > already has code to conditionally process them because they only exist > in >= 8.4: > > /* fsm/vm files added in PG 8.4 */ > if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804) > { > /* > * Copy/link any fsm and vm files, if they exist > */ > > Just give the word and it will be done. I hereby give the word. :-) Specifically, we need to skip copying vm files (only) if coming from a version prior to this commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e16954f3d27fa8e16c379ff6623ae18d6250a39c -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Note that we already have the visibility map, and the accesses needed to update it are already there. Granted, we'll have to change the logic slightly to make it crash safe, but I don't expect that to add any meaningful overhead - the changes are going to be where the bits are set, ie. vacuum, not when the bits are cleared. Granted, we might also want to set the bits more aggressively once they're used by index-only-scans. But done correctly, just taking advantage of the VM that's already there shouldn't add overhead to other operations.
I agree that we need to do tests to demonstrate that there's a gain from the patch, once we have a patch to test. I would be very surprised if there isn't, but that just means the testing is going to be easy.
--
Heikki Linnakangas
I could see some arguments supporting this feature, citing covering indexes as example. But i just want to highlight they are not same. Visibility map approach is totally not related to the covering indexes approach, except the intention of avoiding the heap scan. Because of the small size, we will be having more contentions(People who have worked with Oracle can take the example of a bitmap index on a OLTP database). I was making the suggestion previously to make these crash safe visibility maps optional for a table, so that the overhead, which comes with it, can be avoided for those tables, which have queries that don't support index only scans. The fact that the proposal is for crash safe visibility map, to become a default package of any Postgresql table will definitely have wide ranging implications on OLTP performance.
Gokul.
On Fri, Aug 19, 2011 at 9:19 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote: > The fact that the > proposal is for crash safe visibility map, to become a default package of > any Postgresql table will definitely have wide ranging implications on OLTP > performance. Well, that would certainly be alarming if true, but I don't think it is. As far as I can see, the overhead of making the visibility map crash-safe is just (1) a very small percentage increase in the work being done by VACUUM and (2) a slight possibility of extra work done by a foreground process if the visibility map bit changes at almost exactly the same time the process was about to insert, update, or delete a tuple. If someone comes up with a test where this overhead is enough to measure, then we might need to rethink our whole approach. Maybe we would make it an optional feature, or maybe we would just rip it out and start over with some sort of redesign, or maybe we would look for other optimizations to counterbalance the additional overhead. I don't know. But as far as I can see you're hypothesizing without evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > > I am happy to have pg_upgrade skip upgrading visibility map files --- it > > already has code to conditionally process them because they only exist > > in >= 8.4: > > > > ? ? ? ?/* fsm/vm files added in PG 8.4 */ > > ? ? ? ?if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804) > > ? ? ? ?{ > > ? ? ? ? ? ?/* > > ? ? ? ? ? ? * Copy/link any fsm and vm files, if they exist > > ? ? ? ? ? ? */ > > > > Just give the word and it will be done. > > I hereby give the word. :-) > > Specifically, we need to skip copying vm files (only) if coming from a > version prior to this commit: > > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e16954f3d27fa8e16c379ff6623ae18d6250a39c Done with the attached, applied patch. There was no cat-version bump from that commit (because the format didn't change, just the crash-safeness) so I picked the first cat-version change after this commit. This is only a pg_upgrade 9.2+ issue. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h new file mode 100644 index 6def748..a19b3df *** a/contrib/pg_upgrade/pg_upgrade.h --- b/contrib/pg_upgrade/pg_upgrade.h *************** *** 64,69 **** --- 64,75 ---- #define TABLE_SPACE_SUBDIRS_CAT_VER 201001111 /* postmaster/postgres -b (binary_upgrade) flag added during PG 9.1 development */ #define BINARY_UPGRADE_SERVER_FLAG_CAT_VER 201104251 + /* + * Visibility map changed with this 9.2 commit, + * 8f9fe6edce358f7904e0db119416b4d1080a83aa; pick later catalog version. + */ + #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031 + /* * Each relation is represented by a relinfo structure. diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c new file mode 100644 index d4a420f..df752c5 *** a/contrib/pg_upgrade/relfilenode.c --- b/contrib/pg_upgrade/relfilenode.c *************** transfer_single_new_db(pageCnvCtx *pageC *** 120,128 **** int numFiles = 0; int mapnum; int fileno; ! old_dir[0] = '\0'; for (mapnum = 0; mapnum < size; mapnum++) { char old_file[MAXPGPATH]; --- 120,134 ---- int numFiles = 0; int mapnum; int fileno; ! bool vm_crashsafe_change = false; ! old_dir[0] = '\0'; + /* Do not copy non-crashsafe vm files for binaries that assume crashsafety */ + if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER && + new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER) + vm_crashsafe_change = true; + for (mapnum = 0; mapnum < size; mapnum++) { char old_file[MAXPGPATH]; *************** transfer_single_new_db(pageCnvCtx *pageC *** 168,175 **** for (fileno = 0; fileno < numFiles; fileno++) { if (strncmp(namelist[fileno]->d_name, scandir_file_pattern, ! strlen(scandir_file_pattern)) == 0) { snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir, namelist[fileno]->d_name); --- 174,189 ---- for (fileno = 0; fileno < numFiles; fileno++) { + char *vm_offset = strstr(namelist[fileno]->d_name, "_vm"); + bool is_vm_file = false; + + /* Is a visibility map file? (name ends with _vm) */ + if (vm_offset && strlen(vm_offset) == strlen("_vm")) + is_vm_file = true; + if (strncmp(namelist[fileno]->d_name, scandir_file_pattern, ! strlen(scandir_file_pattern)) == 0 && ! (!is_vm_file || !vm_crashsafe_change)) { snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir, namelist[fileno]->d_name);
On Fri, Aug 19, 2011 at 11:22 AM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> > I am happy to have pg_upgrade skip upgrading visibility map files --- it >> > already has code to conditionally process them because they only exist >> > in >= 8.4: >> > >> > ? ? ? ?/* fsm/vm files added in PG 8.4 */ >> > ? ? ? ?if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804) >> > ? ? ? ?{ >> > ? ? ? ? ? ?/* >> > ? ? ? ? ? ? * Copy/link any fsm and vm files, if they exist >> > ? ? ? ? ? ? */ >> > >> > Just give the word and it will be done. >> >> I hereby give the word. :-) >> >> Specifically, we need to skip copying vm files (only) if coming from a >> version prior to this commit: >> >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e16954f3d27fa8e16c379ff6623ae18d6250a39c > > Done with the attached, applied patch. There was no cat-version bump > from that commit (because the format didn't change, just the > crash-safeness) so I picked the first cat-version change after this > commit. This is only a pg_upgrade 9.2+ issue. Thanks! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Well, that would certainly be alarming if true, but I don't think it
is. As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground process if the visibility map bit changes at almost
exactly the same time the process was about to insert, update, or
delete a tuple.
Let's forget the overhead posed by vacuum. Can you please point me to the design which talks in detail of the second overhead?
Thanks.
Well, that would certainly be alarming if true, but I don't think it
is. As far as I can see, the overhead of making the visibility map
crash-safe is just (1) a very small percentage increase in the work
being done by VACUUM and (2) a slight possibility of extra work done
by a foreground process if the visibility map bit changes at almost
exactly the same time the process was about to insert, update, or
delete a tuple.Let's forget the overhead posed by vacuum. Can you please point me to the design which talks in detail of the second overhead?Thanks.
If you are following the same design that Heikki put forward, then there is a problem with it in maintaining the bits in page and the bits in visibility map in sync, which we have already discussed.
On 19.08.2011 21:06, Gokulakannan Somasundaram wrote: > If you are following the same design that Heikki put forward, then there is > a problem with it in maintaining the bits in page and the bits in visibility > map in sync, which we have already discussed. Are you referring to this: http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing th WAL record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:Are you referring to this: http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing th WAL record.If you are following the same design that Heikki put forward, then there is
a problem with it in maintaining the bits in page and the bits in visibility
map in sync, which we have already discussed.
In that case, say a 100 sessions are trying to update records which fall under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 * 4 exact) covered by one page of visibility map, won't it make the 99 sessions wait for that visibility map while holding the exclusive lock on the 99 heap pages?
Are we going to say, that these kind of situations occur very rarely? Or that the loss of scalability in these situations, is worth the performance during the read-heavy workloads?
In any case, making a database going through all these extra overheads, while they don't even have any index-only scans!!! That definitely should be given a thought.
Gokul.
On Sat, Aug 20, 2011 at 2:51 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote:
On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:On 19.08.2011 21:06, Gokulakannan Somasundaram wrote:Are you referring to this: http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing th WAL record.If you are following the same design that Heikki put forward, then there is
a problem with it in maintaining the bits in page and the bits in visibility
map in sync, which we have already discussed.In that case, say a 100 sessions are trying to update records which fall under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 * 4 exact) covered by one page of visibility map, won't it make the 99 sessions wait for that visibility map while holding the exclusive lock on the 99 heap pages?Are we going to say, that these kind of situations occur very rarely? Or that the loss of scalability in these situations, is worth the performance during the read-heavy workloads?In any case, making a database going through all these extra overheads, while they don't even have any index-only scans!!! That definitely should be given a thought.Gokul.
Please consider the update, i mentioned above occurs and changes the visibility bit of the page.
On Fri, Aug 19, 2011 at 2:51 PM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote: > On Sat, Aug 20, 2011 at 2:25 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> >> On 19.08.2011 21:06, Gokulakannan Somasundaram wrote: >>> >>> If you are following the same design that Heikki put forward, then there >>> is >>> a problem with it in maintaining the bits in page and the bits in >>> visibility >>> map in sync, which we have already discussed. >> >> Are you referring to this: >> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php ? I >> believe Robert's changes to make the visibility map crash-safe covers that. >> Clearing the bit in the visibility map now happens within the same critical >> section as clearing the flag on the heap page and writing th WAL record. >> > In that case, say a 100 sessions are trying to update records which fall > under the 8000*4 heap pages( i assume 2 bits per visibility map - 8 * 1024 * > 4 exact) covered by one page of visibility map, There are about 8000 visibility map bytes per page, so about 64000 bits, each covering one page. So a visibility map page covers about 512MB of heap. > won't it make the 99 > sessions wait for that visibility map while holding the exclusive lock on > the 99 heap pages? Hmm, you have a point. If 100 backends simultaneously write to 100 different pages, and all of those pages are all-visible, then it's possible that they could end up fighting over the buffer content lock on the visibility map page. But why would you expect that to matter? In a heavily updated table, the proportion of visibility map bits that are set figures to be quite low, since they're only set during VACUUM.To have 100 backends simultaneously pick differentpages to write each of which is all-visible seems really unlucky. Even if it does happen from time to time, I suspect the effects would be largely masked by WALInsertLock contention. The visibility map content lock is only taken very briefly, whereas the operations protected by WALInsertLock are much more complex. This does, however, remind me of two other points: 1. Heikki's idea of trying to set visibility map bits more aggressively is probably a good one, but it would be possible to overdo it, because setting visibility map bits is not free. It has an immediate cost - in that we have to write xlog - and a deferred cost - in that it will impose overhead when those pages are re-dirtied. At the moment, I think we're probably too far in the opposite direction - i.e. we leave the visibility map bits unset for too long, leading to a massive amount of deferred work that gets done all at once when VACUUM finally runs. But we shouldn't overcorrect. 2. While we're tinkering with the visibility map, we should think about whether it makes sense to carve out some more bits for such purposes as we may in the future require. Even if we allowed each heap page a byte in the visibility map instead of a single bit, the visibility map would still be roughly 1000 times smaller than the heap; and if there are any situations where the page-level locks become choke points, this would mitigate that effect. There might also be some advantage in that bytes can be atomically set, while bits can't, although I can't immediately think how we'd leverage that. Alternatively, we could widen the field to something less than a full byte, like 2 or 4 bits, if that seems better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 19, 2011 at 4:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Hmm, you have a point. If 100 backends simultaneously write to 100 > different pages, and all of those pages are all-visible, then it's > possible that they could end up fighting over the buffer content lock > on the visibility map page. But why would you expect that to matter? > In a heavily updated table, the proportion of visibility map bits that > are set figures to be quite low, since they're only set during VACUUM. > To have 100 backends simultaneously pick different pages to write > each of which is all-visible seems really unlucky. Even if it does > happen from time to time, I suspect the effects would be largely > masked by WALInsertLock contention. The visibility map content lock > is only taken very briefly, whereas the operations protected by > WALInsertLock are much more complex. Oh, snap. I see another possible problem here. At the time visibilitymap_clear() is called, we're already (and necessarily) holding a content lock on the buffer. And then we go get a content lock on the visibility map page, whose buffer number might be higher or lower than that of the heap page, possibly leading us to violate the rule the buffer content locks must be taken increasing buffer number order. Maybe that's OK, because I can't see that we'd ever acquire any other buffer content lock while already holding a lock on the visibility map buffer. But given this logic, if we did do such a thing, it could result in an undetected deadlock. Hmm.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19.08.2011 23:02, Robert Haas wrote: > On Fri, Aug 19, 2011 at 2:51 PM, Gokulakannan Somasundaram > <gokul007@gmail.com> wrote: >> won't it make the 99 >> sessions wait for that visibility map while holding the exclusive lock on >> the 99 heap pages? > > Hmm, you have a point. If 100 backends simultaneously write to 100 > different pages, and all of those pages are all-visible, then it's > possible that they could end up fighting over the buffer content lock > on the visibility map page. But why would you expect that to matter? > In a heavily updated table, the proportion of visibility map bits that > are set figures to be quite low, since they're only set during VACUUM. > To have 100 backends simultaneously pick different pages to write > each of which is all-visible seems really unlucky. Even if it does > happen from time to time, I suspect the effects would be largely > masked by WALInsertLock contention. The visibility map content lock > is only taken very briefly, whereas the operations protected by > WALInsertLock are much more complex. The above could already happen in 8.4, where the visibility map was introduced. The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not. I have not heard any complaints of contention on VM buffers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 19.08.2011 23:17, Robert Haas wrote: > On Fri, Aug 19, 2011 at 4:02 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> Hmm, you have a point. If 100 backends simultaneously write to 100 >> different pages, and all of those pages are all-visible, then it's >> possible that they could end up fighting over the buffer content lock >> on the visibility map page. But why would you expect that to matter? >> In a heavily updated table, the proportion of visibility map bits that >> are set figures to be quite low, since they're only set during VACUUM. >> To have 100 backends simultaneously pick different pages to write >> each of which is all-visible seems really unlucky. Even if it does >> happen from time to time, I suspect the effects would be largely >> masked by WALInsertLock contention. The visibility map content lock >> is only taken very briefly, whereas the operations protected by >> WALInsertLock are much more complex. > > Oh, snap. I see another possible problem here. > > At the time visibilitymap_clear() is called, we're already (and > necessarily) holding a content lock on the buffer. And then we go get > a content lock on the visibility map page, whose buffer number might > be higher or lower than that of the heap page, possibly leading us to > violate the rule the buffer content locks must be taken increasing > buffer number order. Huh? The rule is that you have to acquire locks on heap pages in increasing page number order. That doesn't apply to the order between the heap and the visibility map. The rule we've established for that is that you have to acquire the lock on the heap page first, before locking the corresponding vm page. It would be good to add a comment about that to the header comment of RelationGetBufferForTuple(), there doesn't seem to be anything about the visibility map buffer arguments there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
The above could already happen in 8.4, where the visibility map was introduced. The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not. I have not heard any complaints of contention on VM buffers.
--
Heikki Linnakangas
a) First of all, it(Visibility Map) should have definitely affected the scalability of postgres in scenarios where in updates occur during a time batch window. May be the increase in speed of vacuums negate that effect.
b) Second, currently the index scans don't touch the visibility map and in future they are going to acquire share lock on that. This should increase the contention.
c) Your statement : "The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not."
b) Second, currently the index scans don't touch the visibility map and in future they are going to acquire share lock on that. This should increase the contention.
c) Your statement : "The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not."
d) In addition, currently there is no WAL Logging, while the bit is cleared, which would not be the case in future and hence the exclusive lock held on the visibility map is going to be held for a longer time.
You might definitely see some performance improvement, if you are testing this in anything less than 4 cores. Bump up the core count and processor count and check whether this affects the load-throughput curve.
Thanks,
Gokul.
Hmm, you have a point. If 100 backends simultaneously write to 100
different pages, and all of those pages are all-visible, then it's
possible that they could end up fighting over the buffer content lock
on the visibility map page. But why would you expect that to matter?
In a heavily updated table, the proportion of visibility map bits that
are set figures to be quite low, since they're only set during VACUUM.
To have 100 backends simultaneously pick different pages to write
each of which is all-visible seems really unlucky. Even if it does
happen from time to time, I suspect the effects would be largely
masked by WALInsertLock contention. The visibility map content lock
is only taken very briefly, whereas the operations protected by
WALInsertLock are much more complex.
by your argument, if WALInserLock is held for 't' seconds, you should definitely be holding visibility map lock for more than time frame 't' . So the index scans have to wait for acquiring the lock on visibility map to check visibility. What we are trading off here is Synchronization Vs I/O. Should we lose the scalability for performance??
Gokul.
On Sat, Aug 20, 2011 at 4:48 PM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote:
I am talking about the contention time frame of the heap page. It will be increased Consider my statement in conjunction with the scenario 2.The above could already happen in 8.4, where the visibility map was introduced. The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not. I have not heard any complaints of contention on VM buffers.
--
Heikki Linnakangasa) First of all, it(Visibility Map) should have definitely affected the scalability of postgres in scenarios where in updates occur during a time batch window. May be the increase in speed of vacuums negate that effect.
b) Second, currently the index scans don't touch the visibility map and in future they are going to acquire share lock on that. This should increase the contention.
c) Your statement : "The contention on the VM buffer would be just as bad whether you hold the heap page lock at the same time or not."
d) In addition, currently there is no WAL Logging, while the bit is cleared, which would not be the case in future and hence the exclusive lock held on the visibility map is going to be held for a longer time.
You might definitely see some performance improvement, if you are testing this in anything less than 4 cores. Bump up the core count and processor count and check whether this affects the load-throughput curve.
By your argument, we can say that no-one will create a index with a function like random(), time(), date(), broken operators etc. Its hard to imagine a index in which a a user will only want to insert and never select on it. Even C++ provides std::map infrastructure for objects, where the user owns the responsibility of writing the operator< correctly. Other databases do the same. Even going one step ahead, we are already going back to such indexes, if there is unique constraint/ referential integrity constraints. But with all such caveats, it was decided we should not allow covering indexes, as they require going back to the index for updates/deletes.
On Sat, Aug 20, 2011 at 4:48 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote: > a) First of all, it(Visibility Map) should have definitely affected the > scalability of postgres in scenarios where in updates occur during a time > batch window. May be the increase in speed of vacuums negate that effect. I think that you have switched gears here and are in this paragraph talking about the 8.4-era visibility map changes rather than my recent work. There is zero evidence that those changes caused any performance problem. I've spent a large chunk of the last four months looking for scalability problems and I haven't come across any evidence that this is an issue. If you think it is, let's have a test case. > b) Second, currently the index scans don't touch the visibility map and in > future they are going to acquire share lock on that. This should increase > the contention. Maybe. Of course, we're only talking about cases where the index-only scan optimization is in use (which is not all of them). And even then, if bringing in the heap pages would have caused buffer evictions and the index-only scan avoids that, then you're trading contention for exclusive locks on the BufFreelistLock and buf mapping locks for shared-lock contention on the visibility map page. Furthermore, the latter will (except in very large relations) only need to be shared-locked and pinned once per scan, whereas you might easily have a case where each index probe forces replacement of a heap page. What I am slightly worried about is that our machinery for taking and releasing buffer pins is going to become a scalability bottleneck at some point, and certainly very hot pages like root index pages and visibility map pages could become hot-spots for such contention. But the experiments I've done so far suggest that there are other more serious bottlenecks that have to be addressed first. If it does rise to the list, we'll find a way to fix it, but I think skipping the index-only scan optimization is going to be a cure worse than the disease. > c) Your statement : "The contention on the VM buffer would be just as bad > whether you hold the heap page lock at the same time or not." > I am talking about the contention time frame of the heap page. It will be > increased Consider my statement in conjunction with the scenario 2. Sure, but here again, what is your evidence that this actually matters? It's not increasing by very much. > d) In addition, currently there is no WAL Logging, while the bit is cleared, > which would not be the case in future and hence the exclusive lock held on > the visibility map is going to be held for a longer time. This is false and has been false since the visibility map was first implemented. > You might definitely see some performance improvement, if you are testing > this in anything less than 4 cores. Bump up the core count and processor > count and check whether this affects the load-throughput curve. I'm fairly certain I already did that experiment, on a 32-core machine, but since the patch you're worried about went in two months ago, I'm a little fuzzy on the details. Maybe you should test it out yourself.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote: > by your argument, if WALInserLock is held for 't' seconds, you should > definitely be holding visibility map lock for more than time frame 't'. Nope, that's not how it works. Perhaps you should read the code. See, e.g., heap_update(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 20, 2011 at 5:06 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote: > By your argument, we can say that no-one will create a index with a function > like random(), time(), date(), broken operators etc. Its hard to imagine a > index in which a a user will only want to insert and never select on it. The whole point of this optimization is to make index access cheaper, not more expensive. You seem convinced it's done the opposite, but you haven't offered much evidence (or any test results) to support that proposition. > Even C++ provides std::map infrastructure for objects, where the user owns > the responsibility of writing the operator< correctly. Other databases do > the same. Even going one step ahead, we are already going back to such > indexes, if there is unique constraint/ referential integrity constraints. > But with all such caveats, it was decided we should not allow covering > indexes, as they require going back to the index for updates/deletes. What we decided NOT to do is put xmin/xmax/cid into the index tuple, for precisely the reasons you mention. That would be catastrophic both for write operations to the table, and for the size of the index.This approach is appealing precisely because a singlevisibility map page can cover a very large chunk of the heap. Is there a potential problem buried somewhere in there? Maybe. But if there is, I'm pretty sure it's something far less obvious than what you seem to be worried about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I think that you have switched gears here and are in this paragraph
talking about the 8.4-era visibility map changes rather than my recent
work. There is zero evidence that those changes caused any
performance problem. I've spent a large chunk of the last four months
looking for scalability problems and I haven't come across any
evidence that this is an issue. If you think it is, let's have a test
case.
Consider the TPC-C benchmark. Currently on a four core box, Oracle does 290000 transactions per minute. Say we are doing something around half of this. So a page should get quickly filled. If a vacuum runs and the transactions are not touched by the MakePayment transaction, then it will be marked as AllVisible. When the MakePayment transaction updates, it should clear that bit. If we test it out, with too little warehouses, we may not see the effect. Of Course the PGPROC infrastructure for generating transaction ids might be the biggest culprit, but if you ignore that this should be visible.
Maybe. Of course, we're only talking about cases where the index-onlyscan optimization is in use (which is not all of them).
But are you saying that, the optimization of looking at visibility map will happen only for Index-only scans and will not use the visibility map infrastructure for the normal index scans? That's definitely a good idea and improvement.
>> d) In addition, currently there is no WAL Logging, while the bit is cleared,
>> which would not be the case in future and hence the exclusive lock held on
>> the visibility map is going to be held for a longer time.
> This is false and has been false since the visibility map was first implemented.
I can't understand this. If you are not doing this, then it would cause consistency issues. Are you saying, we have a crash safe visibility map, but you don't follow "log the change before changing the contents"/ WAL principle. If so, please explain in detail. If you are doing it in the normal way, then you should be logging the changes before making the changes to the buffer and during that timeframe, you should be holding the lock on the buffer. Heikki specifically pointed out, that you have brought in the WAL Logging of visibility map, within the critical section.
Heikki's comments, i am quoting:
"I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing the WAL record."
I would be able to respond to your other statements, once we are clear here.
Thanks,
Gokul.
>> d) In addition, currently there is no WAL Logging, while the bit is cleared,
>> which would not be the case in future and hence the exclusive lock held on
>> the visibility map is going to be held for a longer time.
> This is false and has been false since the visibility map was first implemented.
I can't understand this. If you are not doing this, then it would cause consistency issues. Are you saying, we have a crash safe visibility map, but you don't follow "log the change before changing the contents"/ WAL principle. If so, please explain in detail. If you are doing it in the normal way, then you should be logging the changes before making the changes to the buffer and during that timeframe, you should be holding the lock on the buffer. Heikki specifically pointed out, that you have brought in the WAL Logging of visibility map, within the critical section.
Heikki's comments, i am quoting:
"I believe Robert's changes to make the visibility map crash-safe covers that. Clearing the bit in the visibility map now happens within the same critical section as clearing the flag on the heap page and writing the WAL record."
I would be able to respond to your other statements, once we are clear here.
Thanks,
Gokul.
On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram
OK. I took a look at the patch you have supplied in http://postgresql.1045698.n5.nabble.com/crash-safe-visibility-map-take-five-td4377235.html
There is a code like this.
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ &vmbuffer);
}
Here, you are not making an entry into the WAL. then there is an assumption that the two bits will be in sync without any WAL entry. There is a chance that the visibility map might be affected by partial page writes, where clearing of a particular bit might not have been changed. And i am thinking a lot of such issues. Can you just explain the background logic behind ignoring the principle of WAL logging? What are the implemented principles, that protect the Visibility map pages??
Thanks,
Gokul.
<gokul007@gmail.com> wrote:Nope, that's not how it works. Perhaps you should read the code.
> by your argument, if WALInserLock is held for 't' seconds, you should
> definitely be holding visibility map lock for more than time frame 't'.
See, e.g., heap_update().
--
There is a code like this.
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ &vmbuffer);
}
Here, you are not making an entry into the WAL. then there is an assumption that the two bits will be in sync without any WAL entry. There is a chance that the visibility map might be affected by partial page writes, where clearing of a particular bit might not have been changed. And i am thinking a lot of such issues. Can you just explain the background logic behind ignoring the principle of WAL logging? What are the implemented principles, that protect the Visibility map pages??
Thanks,
Gokul.
> By your argument, we can say that no-one will create a index with a functionThe whole point of this optimization is to make index access cheaper,
> like random(), time(), date(), broken operators etc. Its hard to imagine a
> index in which a a user will only want to insert and never select on it.
not more expensive. You seem convinced it's done the opposite, but
you haven't offered much evidence (or any test results) to support
that proposition.
I hope you are referring to thick indexes/covering indexes/indexes with snapshot. Why do you think its done the opposite? In fact all the other databases like Oracle, SQL-Server, Sybase implement the indexes with snapshot (that's the only one they support). It makes the index tuple larger by 8 bytes, but avoids the heap-fetch. I think, i ran a couple of benchmarks, when i submitted the patch and showed the improvement. The trade-off in that case was simple. Size of the index Vs avoiding a disk I/O. User still has the choice of creating indexes without snapshot( it was provided as an optional index).
What we decided NOT to do is put xmin/xmax/cid into the index tuple,
for precisely the reasons you mention. That would be catastrophic
both for write operations to the table, and for the size of the index.
Why it would be catastrophic for write operations on table?? -please explain me.
The trade-off in that case was simple. Size of the index Vs avoiding a disk I/O. There was no catastrophic damage on the size of the index, as far as i can see.
I made this point, because Heikki pointed out that since no-one is complaining about some performance problem, and so we can assume that it doesn't exist. But the thick index proposal was shot down on the context, some one might create a index on a function like random(), time(). date() or with broken operators, effectively meaning that you can insert into the index and cannot select back. We are already doing unique checks and referential integrity checks on that kind of indexes(which would all be wrong), but still we should not be working in that area, to help user not make that mistake of creating such indexes. So we should apply the same principle for decision making here also.
Gokul.
The trade-off in that case was simple. Size of the index Vs avoiding a disk I/O. There was no catastrophic damage on the size of the index, as far as i can see.
I made this point, because Heikki pointed out that since no-one is complaining about some performance problem, and so we can assume that it doesn't exist. But the thick index proposal was shot down on the context, some one might create a index on a function like random(), time(). date() or with broken operators, effectively meaning that you can insert into the index and cannot select back. We are already doing unique checks and referential integrity checks on that kind of indexes(which would all be wrong), but still we should not be working in that area, to help user not make that mistake of creating such indexes. So we should apply the same principle for decision making here also.
Gokul.
On 21.08.2011 07:10, Gokulakannan Somasundaram wrote: >>> d) In addition, currently there is no WAL Logging, while the bit is > cleared, >>> which would not be the case in future and hence the exclusive lock held > on >>> the visibility map is going to be held for a longer time. > >> This is false and has been false since the visibility map was first > implemented. > > I can't understand this. If you are not doing this, then it would cause > consistency issues. Are you saying, we have a crash safe visibility map, but > you don't follow "log the change before changing the contents"/ WAL > principle. If so, please explain in detail. If you are doing it in the > normal way, then you should be logging the changes before making the changes > to the buffer and during that timeframe, you should be holding the lock on > the buffer. Heikki specifically pointed out, that you have brought in the > WAL Logging of visibility map, within the critical section. I think you two are talking slightly past each other. There is no extra WAL record written when a bit is cleared in the visibility map, there is just a flag in the WAL record of the heap insert/update/delete. That is what Robert was trying to explain, that part hasn't changed since 8.4. What *did* change, however, in master, when the visibility map was made crash-safe, is the duration the lock on the visibility map page is held. Before that, the visibility map page was locked only briefly *after* the changes to the heap page were already applied and WAL record written. Now, the VM page lock is acquired and released at the same time as the lock on the heap page. It's held while the heap page changes are made and WAL record is written. I believe that's what Gokulakannan was trying to point out, and is worried that you might get contention on the VM page lock now because it's held for a much longer duration. Gokulakannan, if you could come up with a test case that demonstrates that contention (or the lack thereof), that would be good. Otherwise we're just speculating. If it's an issue, perhaps we could release the VM page lock early. We're not updating the LSN on it, so we don't need to wait for the WAL record to be written, I think. It's a bit out of the ordinary, though, so I wouldn't like to do that without an actual test case that shows it's an issue. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 21.08.2011 07:41, Gokulakannan Somasundaram wrote: > On Sat, Aug 20, 2011 at 4:57 AM, Gokulakannan Somasundaram > >> <gokul007@gmail.com> wrote: >>> by your argument, if WALInserLock is held for 't' seconds, you should >>> definitely be holding visibility map lock for more than time frame 't'. >> >> Nope, that's not how it works. Perhaps you should read the code. >> See, e.g., heap_update(). >> >> -- >> > OK. I took a look at the patch you have supplied in > http://postgresql.1045698.n5.nabble.com/crash-safe-visibility-map-take-five-td4377235.html Here's the patch as it was committed: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=503c7305a1e379f95649eef1a694d0c1dbdc674a > There is a code like this. > > { > all_visible_cleared = true; > PageClearAllVisible(BufferGetPage(buffer)); > + visibilitymap_clear(relation, > + ItemPointerGetBlockNumber(&(heaptup->t_self)), > +&vmbuffer); > } > > Here, you are not making an entry into the WAL. then there is an assumption > that the two bits will be in sync without any WAL entry. There is a chance > that the visibility map might be affected by partial page writes, where > clearing of a particular bit might not have been changed. And i am thinking > a lot of such issues. Can you just explain the background logic behind > ignoring the principle of WAL logging? What are the implemented principles, > that protect the Visibility map pages?? The all_visible_cleared flag is included in the WAL record of the insert (or update or delete). Partial page writes are not a problem, because we always fetch the VM page and clear the bit, regardless of the LSN on the VM page. PS. Robert, the LOCKING section in the header comment of visibilitymap.c is out-of-date: it claims that the VM bit is cleared after releasing the lock on the heap page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
>> The all_visible_cleared flag is included in the WAL record of the insert (or update or delete). Partial page writesare not a problem, because we <br />>> always fetch the VM page and clear the bit, regardless of the LSN on theVM page.<br /><br /><br />Two things <br />a) First, my understanding of checkpoint is that it flushes all the pages,that got changed below a particular LSN. If we are not having a LSN in the visibility map, how we will be sure, thata visibility map page is getting flushed/not? With the following approach, i can see only one issue. If the heap pagegets written and checkpointed and the visibility map doesn't get synced during the checkpoint, then there is an issue.Can you please explain me, how we get the assurance?<br /><br />b) Even if we have a contention issue, Visibility mapis a solution for a considerable number of database scenarios. But it should not become a default package. A table, withno chance of index-only scans should not incur the extra overhead of crash safe visibility maps. Those tables shouldbe sparred from this extra overhead, as they don't have index only scans.<br /><br />Gokul.<br />
a) First, my understanding of checkpoint is that it flushes all the pages, that got changed below a particular LSN. If we are not having a LSN in the visibility map, how we will be sure, that a visibility map page is getting flushed/not?
Please ignore this statement. I confused between the checkpoints implemented in Postgres and some other database.
Gokul.
On Sun, Aug 21, 2011 at 12:10 AM, Gokulakannan Somasundaram <gokul007@gmail.com> wrote: > Consider the TPC-C benchmark. Currently on a four core box, Oracle does > 290000 transactions per minute. Say we are doing something around half of > this. So a page should get quickly filled. If a vacuum runs and the > transactions are not touched by the MakePayment transaction, then it will be > marked as AllVisible. When the MakePayment transaction updates, it should > clear that bit. If we test it out, with too little warehouses, we may not > see the effect. Of Course the PGPROC infrastructure for generating > transaction ids might be the biggest culprit, but if you ignore that this > should be visible. Actual testing does not appear to show a bottleneck in either of these areas. >> Maybe. Of course, we're only talking about cases where the index-only >> scan optimization is in use (which is not all of them). > > But are you saying that, the optimization of looking at visibility map will > happen only for Index-only scans and will not use the visibility map > infrastructure for the normal index scans? That's definitely a good idea and > improvement. Well, yes. And not only that, but the index-only scans patch has been posted on this very mailing list, along with detailed submission notes. So you could go read, or try it, before jumping to the conclusion that it doesn't work. >>> d) In addition, currently there is no WAL Logging, while the bit is >>> cleared, >>> which would not be the case in future and hence the exclusive lock held >>> on >>> the visibility map is going to be held for a longer time. > >> This is false and has been false since the visibility map was first >> implemented. > > I can't understand this. If you are not doing this, then it would cause > consistency issues. Are you saying, we have a crash safe visibility map, but > you don't follow "log the change before changing the contents"/ WAL > principle. If so, please explain in detail. If you are doing it in the > normal way, then you should be logging the changes before making the changes > to the buffer and during that timeframe, you should be holding the lock on > the buffer. Heikki specifically pointed out, that you have brought in the > WAL Logging of visibility map, within the critical section. > > Heikki's comments, i am quoting: > "I believe Robert's changes to make the visibility map crash-safe covers > that. Clearing the bit in the visibility map now happens within the same > critical section as clearing the flag on the heap page and writing the WAL > record." > > I would be able to respond to your other statements, once we are clear here. There are extensive comments on this in visibilitymap.c and, in heapam.c, heap_xlog_visible(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
There are extensive comments on this in visibilitymap.c and, in
heapam.c, heap_xlog_visible().
I went through the design again and again. I am convinced that this should not have any functional bugs and should not cause much performance issues.
Nice thoughts on bypassing the WAL Logging.
Gokul.
On Sun, Aug 21, 2011 at 3:13 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > PS. Robert, the LOCKING section in the header comment of visibilitymap.c is > out-of-date: it claims that the VM bit is cleared after releasing the lock > on the heap page. Fixed, along with your other observation a couple of emails upthread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company