Thread: Do we need so many hint bits?
Related to discussion here: http://archives.postgresql.org/message-id/CAHyXU0zn5emePLedoZUGrAQiF92F-YjvFr-P5vUh6n0WpKZ6PQ@mail.gmail.com It occurred to me recently that many of the hint bits aren't terribly important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly has a purpose, and we'd expect it to be used many times following the initial CLOG lookup. But the other tuple hint bits seem to be there just for symmetry, because they shouldn't last long. If HEAP_XMIN_INVALID or HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it should just be changed to InvalidTransactionId. Removing those 3 hints would give us 3 more flag bits (eventually, after we are sure they aren't just leftover), and it would also reduce the chance that a page is dirtied for no other reason than to set them. It might even take a few cycles out of the tqual.c routines, or at least reduce the code size. Not a huge win, but I don't see much downside either. Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced in the visibility map patch, apparently as a way to know when to clear the VM bit when doing an update. It was then also used for scans, which showed a significant speedup. But I wonder: why not just use the visibilitymap directly from those places? It can be used for the scan because it is crash safe now (not possible before). And since it's only one lookup per scanned page, then I don't think it would be a measurable performance loss there. Inserts/updates/deletes also do a significant amount of work, so again, I doubt it's a big drop in performance there -- maybe under a lot of concurrency or something. The benefit of removing PD_ALL_VISIBLE would be significantly higher. It's quite common to load a lot of data, and then do some reads for a while (setting hint bits and flushing them to disk), and then do a VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the pages again. Also, if I remember correctly, Robert went to significant effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM bits consistent. Maybe this was all discussed before? All of these hint bits will have a bit more of a performance impact after checksums are introduced (for those that use them in conjunction with large data loads), so I'm looking for some simple ways to mitigate those effects. What kind of worst-case tests could I construct to see if there are worrying performance effects to removing these hint bits? Regards,Jeff Davis
On 15 November 2012 19:42, Jeff Davis <pgsql@j-davis.com> wrote: > many of the hint bits aren't terribly important The truth is that nobody knows because there is no way of knowing. A good way forwards would be to collect detailed stats by table. We can then run various loads and see if any of this is worthwhile. I'm sure we'd learn enough to understand how to optimise things. track_hints? track_buffers? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 15 November 2012 19:42, Jeff Davis <pgsql@j-davis.com> wrote: >> many of the hint bits aren't terribly important > The truth is that nobody knows because there is no way of knowing. We had a discussion awhile back in which the idea of *no* hint bits was advocated, and someone (I think Robert) did some preliminary tests that pretty much shot it down. However, I don't recall anyone suggesting before that the four existing bits might not all be equally worthwhile. It's worth looking into. The hard part is probably agreeing on the test case or cases to measure behavior for. regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes: > It occurred to me recently that many of the hint bits aren't terribly > important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly > has a purpose, and we'd expect it to be used many times following the > initial CLOG lookup. Right. > But the other tuple hint bits seem to be there just for symmetry, > because they shouldn't last long. If HEAP_XMIN_INVALID or > HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed > soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it > should just be changed to InvalidTransactionId. Hm. It is not cheaper to change xmax to 0 than it is to set the hint bit --- you still need a write, and there are also added locking and atomicity worries --- so I'm not convinced by your argument there. But you might be right that the expected number of wins from the other two bits is a lot less. > Removing those 3 hints would give us 3 more flag bits (eventually, after > we are sure they aren't just leftover), and it would also reduce the > chance that a page is dirtied for no other reason than to set them. We aren't pressed for flag bits particularly. I think the main attraction of this idea is precisely to reduce unnecessary page dirties, and so that leads me to suggest a variant: keep the four bits defined as now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the page is already dirty. This would make it a straight-up trade of more clog consultation for fewer page dirties. > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > in the visibility map patch, apparently as a way to know when to clear > the VM bit when doing an update. It was then also used for scans, which > showed a significant speedup. But I wonder: why not just use the > visibilitymap directly from those places? Added contention for access to the visibility map pages would be a problem. regards, tom lane
On 15 November 2012 22:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 15 November 2012 19:42, Jeff Davis <pgsql@j-davis.com> wrote: >>> many of the hint bits aren't terribly important > >> The truth is that nobody knows because there is no way of knowing. > > We had a discussion awhile back in which the idea of *no* hint bits > was advocated, and someone (I think Robert) did some preliminary > tests that pretty much shot it down. However, I don't recall > anyone suggesting before that the four existing bits might not all > be equally worthwhile. It's worth looking into. Itsn't that what I said? In case of doubt, Yes, its a great idea and worth looking into. The question is *how* we look into it. > The hard part is > probably agreeing on the test case or cases to measure behavior for. I think thats impossible. There are just too many possible cases. Measuring top-level performance without measuring low level stats just means we can't tell the difference between a test that didn't exercise the code and a test where there was no difference. We need detailed stats that allow many people to make their own tests and to report on what they find. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 16, 2012 at 8:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Davis <pgsql@j-davis.com> writes:> Removing those 3 hints would give us 3 more flag bits (eventually, afterWe aren't pressed for flag bits particularly. I think the main
> we are sure they aren't just leftover), and it would also reduce the
> chance that a page is dirtied for no other reason than to set them.
attraction of this idea is precisely to reduce unnecessary page dirties,
and so that leads me to suggest a variant: keep the four bits defined as
now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the
page is already dirty.
Another approach could be to set those additional bits, but don't dirty the page. So if the page is already dirty or gets dirty later on before its eviction, those hint bits will get recorded. We can also try some other variants like: set the bits and dirty the page if the XID is beyond the horizon that CLOG buffers can track. A very old XID will most likely cause a CLOG page read later on.
I forgot if we have a way to differentiate between critical and non-critical dirtiness of a page. If we don't, that might be another worthwhile thing to try and test. Setting hint bits is a non-critical operation. So such a dirty page may be discarded if the system is under pressure.
Well, all of the above may have been discussed and rejected in the past. What we need is a thorough benchmarking to know what is better in which situation.
Thanks,
Pavan
On 15 November 2012 22:21, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Removing those 3 hints would give us 3 more flag bits (eventually, after >> we are sure they aren't just leftover), and it would also reduce the >> chance that a page is dirtied for no other reason than to set them. > > We aren't pressed for flag bits particularly. I think the main > attraction of this idea is precisely to reduce unnecessary page dirties, > and so that leads me to suggest a variant: keep the four bits defined as > now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the > page is already dirty. This would make it a straight-up trade of more > clog consultation for fewer page dirties. Hmm, I thought Alvaro wanted an extra flag bit for foreign key locks but couldn't find it. Come to think of it, why do we have XMIN_INVALID and XMAX_INVALID? We never need both at the same time, so we can just use one... XMIN_INVALID -> XACT_INVALID XMAX_INVALID == XMIN_COMMITTED | XACT_INVALID -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2012-11-15 at 22:21 -0500, Tom Lane wrote: > We aren't pressed for flag bits particularly. I think the main > attraction of this idea is precisely to reduce unnecessary page dirties, > and so that leads me to suggest a variant: keep the four bits defined as > now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the > page is already dirty. This would make it a straight-up trade of more > clog consultation for fewer page dirties. Sounds reasonable. > > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > > in the visibility map patch, apparently as a way to know when to clear > > the VM bit when doing an update. It was then also used for scans, which > > showed a significant speedup. But I wonder: why not just use the > > visibilitymap directly from those places? > > Added contention for access to the visibility map pages would be a > problem. Can you think of a reasonable (or worst-case) test scenario that would illustrate this problem? Would I need a many-core box to actually test this? Also, what locks do you expect to be contended? The buffer header lock while pinning the VM page? Regards,Jeff Davis
On 15 November 2012 23:12, Simon Riggs <simon@2ndquadrant.com> wrote: > We need detailed stats that allow many people to make their own tests > and to report on what they find. To work out how to proceed we need * wait event times on clog accesses at table-level * numbers of blocks dirtied by hint bit settings at table-level That way we can see which tables need to have hint bits set and which don't. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 15, 2012 at 9:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> It occurred to me recently that many of the hint bits aren't terribly >> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly >> has a purpose, and we'd expect it to be used many times following the >> initial CLOG lookup. > > Right. > >> But the other tuple hint bits seem to be there just for symmetry, >> because they shouldn't last long. If HEAP_XMIN_INVALID or >> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed >> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it >> should just be changed to InvalidTransactionId. > > Hm. It is not cheaper to change xmax to 0 than it is to set the hint > bit --- you still need a write, and there are also added locking and > atomicity worries --- so I'm not convinced by your argument there. > But you might be right that the expected number of wins from the other > two bits is a lot less. Is that true in a post checksum world though? Given that we are logging changes can we relax atomicity expectations? IIRC xmin/xmax are aligned, how come you can't just set InvalidTransactionId for INVALID and 'FrozenTransactionId' for COMMITTED? Why can't you do this now? merlin
On 2012-11-16 08:43:12 -0600, Merlin Moncure wrote: > On Thu, Nov 15, 2012 at 9:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > >> It occurred to me recently that many of the hint bits aren't terribly > >> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly > >> has a purpose, and we'd expect it to be used many times following the > >> initial CLOG lookup. > > > > Right. > > > >> But the other tuple hint bits seem to be there just for symmetry, > >> because they shouldn't last long. If HEAP_XMIN_INVALID or > >> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed > >> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it > >> should just be changed to InvalidTransactionId. > > > > Hm. It is not cheaper to change xmax to 0 than it is to set the hint > > bit --- you still need a write, and there are also added locking and > > atomicity worries --- so I'm not convinced by your argument there. > > But you might be right that the expected number of wins from the other > > two bits is a lot less. > > Is that true in a post checksum world though? Given that we are > logging changes can we relax atomicity expectations? IIRC xmin/xmax > are aligned, how come you can't just set InvalidTransactionId for > INVALID and 'FrozenTransactionId' for COMMITTED? Why can't you do > this now? Uhm. The latter doesn't really work if you have any transactions that might not see that row or am I missing something? Greetings, Andres Freund
On Fri, Nov 16, 2012 at 8:58 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2012-11-16 08:43:12 -0600, Merlin Moncure wrote: >> On Thu, Nov 15, 2012 at 9:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Jeff Davis <pgsql@j-davis.com> writes: >> >> It occurred to me recently that many of the hint bits aren't terribly >> >> important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly >> >> has a purpose, and we'd expect it to be used many times following the >> >> initial CLOG lookup. >> > >> > Right. >> > >> >> But the other tuple hint bits seem to be there just for symmetry, >> >> because they shouldn't last long. If HEAP_XMIN_INVALID or >> >> HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed >> >> soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it >> >> should just be changed to InvalidTransactionId. >> > >> > Hm. It is not cheaper to change xmax to 0 than it is to set the hint >> > bit --- you still need a write, and there are also added locking and >> > atomicity worries --- so I'm not convinced by your argument there. >> > But you might be right that the expected number of wins from the other >> > two bits is a lot less. >> >> Is that true in a post checksum world though? Given that we are >> logging changes can we relax atomicity expectations? IIRC xmin/xmax >> are aligned, how come you can't just set InvalidTransactionId for >> INVALID and 'FrozenTransactionId' for COMMITTED? Why can't you do >> this now? > > Uhm. The latter doesn't really work if you have any transactions that > might not see that row or am I missing something? nope. you're right. merlin
On Thu, Nov 15, 2012 at 10:14 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > Another approach could be to set those additional bits, but don't dirty the > page. So if the page is already dirty or gets dirty later on before its > eviction, those hint bits will get recorded. We can also try some other > variants like: set the bits and dirty the page if the XID is beyond the > horizon that CLOG buffers can track. A very old XID will most likely cause a > CLOG page read later on. > > I forgot if we have a way to differentiate between critical and non-critical > dirtiness of a page. If we don't, that might be another worthwhile thing to > try and test. Setting hint bits is a non-critical operation. So such a dirty > page may be discarded if the system is under pressure. > > Well, all of the above may have been discussed and rejected in the past. > What we need is a thorough benchmarking to know what is better in which > situation. Yeah: all previous ideas attempts manage hint bits through trade-offs have failed because of general nervousness about who loses. The unlucky guy who has high transaction rates that suddenly has a big increase in clog activity is in for a world of hurt. Also, AIUI 'set but don't dirty' strategies are less effective if/when checksums are implemented as currently proposed. Previously hint bit mitigation was all about saving some i/o. Now, everything is getting rammed through the WAL lock. merlin
On 2012-11-15 16:42:57 -0800, Jeff Davis wrote: > Related to discussion here: > http://archives.postgresql.org/message-id/CAHyXU0zn5emePLedoZUGrAQiF92F-YjvFr-P5vUh6n0WpKZ6PQ@mail.gmail.com > > It occurred to me recently that many of the hint bits aren't terribly > important (at least it's not obvious to me). HEAP_XMIN_COMMITTED clearly > has a purpose, and we'd expect it to be used many times following the > initial CLOG lookup. > But the other tuple hint bits seem to be there just for symmetry, > because they shouldn't last long. If HEAP_XMIN_INVALID or > HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed > soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it > should just be changed to InvalidTransactionId. Wrt HEAP_XMAX_COMMITTED: It can take an *awfully* long time till autovacuum crosses the thresholds the next time for a big table. I also think we cannot dismiss the case of longrunning transactions because vacuum won't be able to cleanup those rows in that case. Wrt HEAP_(XMIN|XMAX)_INVALID: yes, if we are in need of new flag bits those sound like a good target to me. > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > in the visibility map patch, apparently as a way to know when to clear > the VM bit when doing an update. It was then also used for scans, which > showed a significant speedup. But I wonder: why not just use the > visibilitymap directly from those places? It can be used for the scan > because it is crash safe now (not possible before). And since it's only > one lookup per scanned page, then I don't think it would be a measurable > performance loss there. Inserts/updates/deletes also do a significant > amount of work, so again, I doubt it's a big drop in performance there > -- maybe under a lot of concurrency or something. > > The benefit of removing PD_ALL_VISIBLE would be significantly higher. > It's quite common to load a lot of data, and then do some reads for a > while (setting hint bits and flushing them to disk), and then do a > VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the > pages again. Also, if I remember correctly, Robert went to significant > effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM > bits consistent. Maybe this was all discussed before? As far as I understand the code the crash-safety aspects of the visibilitymap currently rely on on having the knowledge that ALL_VISIBLE has been cleared during a heap_(insert|update|delete). That allows management of the visibilitymap without it being xlogged itself which seems pretty important to me. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 15, 2012 at 7:42 PM, Jeff Davis <pgsql@j-davis.com> wrote: > But the other tuple hint bits seem to be there just for symmetry, > because they shouldn't last long. If HEAP_XMIN_INVALID or > HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed > soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it > should just be changed to InvalidTransactionId. "Soon" is a relative term. I doubt that we can really rely on vacuum to be timely enough to avoid pain here - you can easily have tens of thousands of hits on the same tuple before vacuum gets around to dealing with it. Now, we might be able to rejigger things to avoid that. For example, maybe it'd be possible to arrange things so that when we see an invalid xmin, we set the flag that triggers a HOT prune instead of setting the hint bit. That would probably be good enough to dispense with the hint bit, and maybe better altogether better than the current system, because now the next time someone (including us) locks the buffer we'll nuke the entire tuple, which would not only make it cheaper to scan but also frees up space in the buffer sooner. However, that solution only works for invalid-xmin. For committed-xmax, there could actually be quite a long time before the page can be pruned, because there can be some other backend holding an old snapshot open. A one-minute reporting query in another database, which is hardly an unreasonable scenario, could result in many, many additional CLOG lookups, which are already a major contention point at high concurrencies. I think that bit is probably pretty important, and I don't see a viable way to get rid of it, though maybe someone can think of one. For invalid-xmax, I agree that we could probably just change xmax to InvalidTransactionId, if we need to save bit-space. In the past Tom and I think also Alvaro have been skeptical about anything that would overwrite xmin/xmax values too quickly for forensic reasons, but maybe it's worth considering. > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > in the visibility map patch, apparently as a way to know when to clear > the VM bit when doing an update. It was then also used for scans, which > showed a significant speedup. But I wonder: why not just use the > visibilitymap directly from those places? Well, you'd have to look up, lock and pin the page to do that. I suspect that overhead is pretty significant. The benefit of noticing that the flag is set is that you need not call HeapTupleSatisfiesMVCC for each tuple on the page: checking one bit in the page header is a lot cheaper than calling that function for every tuple. However, if you had to lock and pin a second page in order to check whether the page is all-visible, I suspect it wouldn't be a win; you'd probably be better off just doing the HeapTupleSatisfiesMVCC checks for each tuple. One of the main advantages of PD_ALL_VISIBLE is that if you do an insert, update, or delete on a page where that bit isn't set, you need not lock and pin the visibility map page, because you already know that the bit will be clear in the visibility map. If the data is being rapidly modified, you'll get the benefit of this optimization most of the time, only losing it when vacuum has visited recently. I hope that's not premature optimization because I sure sweat a lot of blood last release cycle to keep it working like that. I had a few doubts at the time about how much we were winning there, but I don't actually have any hard data either way, so I would be reluctant to assume it doesn't matter. Even if it doesn't, the sequential-scan optimization definitely matters a LOT, as you can easily verify. One approach that I've been hoping to pursue is to find a way to make CLOG lookups cheaper and more concurrent. I started to work on some concurrent hash table code, which you can find here: http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash The concurrency properties of this code are vastly better than what we have now, but there are cases where it loses vs. dynahash when there's no concurrency. That might be fixable or just not a big deal, though.A bigger problem is that I got sucked off into otherthings before I was able to get as far with it as I wanted to; in particular, I have only unit test results for it, and haven't tried to integrate it into the SLRU code yet. But I'm not sure any of this is going to fundamentally chip away at the need for hint bits all that much. Making CLOG lookups cheaper or less frequent is all to the good, but the prognosis for improving things enough that we can dump some or all of the hint bits completely seems uncertain at best. Even if we COULD dump everything but heap-xmin-committed, how much would that really help with the disk-write problem? I bet heap-xmin-committed gets set far more often than the other three put together. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2012-11-15 at 22:21 -0500, Tom Lane wrote: > We aren't pressed for flag bits particularly. I think the main > attraction of this idea is precisely to reduce unnecessary page dirties, > and so that leads me to suggest a variant: keep the four bits defined as > now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the > page is already dirty. This would make it a straight-up trade of more > clog consultation for fewer page dirties. When I tried a simple test on my laptop, where I removed the XMIN_INVALID and XMAX_COMMITTED bits on a simple SELECT test (no aborts, no deletes, fits in shared_buffers) I saw a pretty consistent 2% speedup. Not earth-shattering, but it would be nice if we could figure out a way to do that. Not sure where that speedup came from, to be honest, it doesn't look like much difference in the work being done in the visibility check. Regards,Jeff Davis
On Fri, 2012-11-16 at 11:58 -0500, Robert Haas wrote: > > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > > in the visibility map patch, apparently as a way to know when to clear > > the VM bit when doing an update. It was then also used for scans, which > > showed a significant speedup. But I wonder: why not just use the > > visibilitymap directly from those places? > > Well, you'd have to look up, lock and pin the page to do that. I > suspect that overhead is pretty significant. The benefit of noticing > that the flag is set is that you need not call HeapTupleSatisfiesMVCC > for each tuple on the page: checking one bit in the page header is a > lot cheaper than calling that function for every tuple. However, if > you had to lock and pin a second page in order to check whether the > page is all-visible, I suspect it wouldn't be a win; you'd probably be > better off just doing the HeapTupleSatisfiesMVCC checks for each > tuple. That's pretty easy to test. Here's what I got on a 10M record table (Some runs got some strangely high numbers around 1700ms, which I assume is because it's difficult to keep the data in shared buffers, so I took the lower numbers.): PD_ALL_VISIBLE: 661ms VM Lookup: 667ms Neither: 740ms Even if pinning the vm buffer were slow, we could keep the pin longer during a scan (it seems like the VM API is designed for that kind of a use case), so I don't think scans are a problem at all, even if there is a lot of concurrency. > One of the main advantages of PD_ALL_VISIBLE is that if you do an > insert, update, or delete on a page where that bit isn't set, you need > not lock and pin the visibility map page, because you already know > that the bit will be clear in the visibility map. If the data is > being rapidly modified, you'll get the benefit of this optimization > most of the time, only losing it when vacuum has visited recently. I > hope that's not premature optimization because I sure sweat a lot of > blood last release cycle to keep it working like that. I had a few > doubts at the time about how much we were winning there, but I don't > actually have any hard data either way, so I would be reluctant to > assume it doesn't matter. This is a more plausible concern, because it's per-tuple rather than per-page. However, I think we might be able to save the current VM buffer across multiple calls of heap_insert|update|delete in nodeModifyTable. At first glance, it doesn't look very difficult. It's unfortunate that I didn't think much about this while you were working on it before; but difficult-to-write code is also difficult to maintain, so simplifying it would be nice. > Even if it doesn't, the sequential-scan optimization definitely > matters a LOT, as you can easily verify. I am in no way suggesting that we remove that optimization; I am suggesting that we can get the exact same benefit by looking at the VM directly. > One approach that I've been hoping to pursue is to find a way to make > CLOG lookups cheaper and more concurrent. I started to work on some > concurrent hash table code, which you can find here: > > http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash > > The concurrency properties of this code are vastly better than what we > have now, but there are cases where it loses vs. dynahash when there's > no concurrency. That might be fixable or just not a big deal, though. > A bigger problem is that I got sucked off into other things before I > was able to get as far with it as I wanted to; in particular, I have > only unit test results for it, and haven't tried to integrate it into > the SLRU code yet. Great work! > But I'm not sure any of this is going to fundamentally chip away at > the need for hint bits all that much. Making CLOG lookups cheaper or > less frequent is all to the good, but the prognosis for improving > things enough that we can dump some or all of the hint bits completely > seems uncertain at best. Even if we COULD dump everything but > heap-xmin-committed, how much would that really help with the > disk-write problem? I bet heap-xmin-committed gets set far more > often than the other three put together. Right now I'm setting my sights on PD_ALL_VISIBLE. I know that causes a lot of extra page dirties, and it seems like it doesn't serve all that much purpose after the VM became crash-safe. And it would simplify the code. If we can't get rid of the other hint bits, then so be it. > > But the other tuple hint bits seem to be there just for symmetry, > > because they shouldn't last long. If HEAP_XMIN_INVALID or > > HEAP_XMAX_COMMITTED is set, then it's (hopefully) going to be vacuumed > > soon, and gone completely. And if HEAP_XMAX_INVALID is set, then it > > should just be changed to InvalidTransactionId. > > However, that solution only works for invalid-xmin. For > committed-xmax, there could actually be quite a long time before the > page can be pruned, because there can be some other backend holding an > old snapshot open. I think I'll set aside this part of the idea for now, because it doesn't seem like a huge win after I thought about it, anyway. If HEAP_XMIN_INVALID or HEAP_XMAX_COMMITTED are set, it's likely to be dirty anyway, unless it's a bulk delete or something. Regards,Jeff Davis
On Thu, Nov 15, 2012 at 4:42 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > in the visibility map patch, apparently as a way to know when to clear > the VM bit when doing an update. It was then also used for scans, which > showed a significant speedup. But I wonder: why not just use the > visibilitymap directly from those places? It can be used for the scan > because it is crash safe now (not possible before). And since it's only > one lookup per scanned page, Wouldn't you need to lock two pages simultaneously, for each table page, in order to ensure that there are no races? For a full table scan, that might not be so good. > then I don't think it would be a measurable > performance loss there. Inserts/updates/deletes also do a significant > amount of work, so again, I doubt it's a big drop in performance there > -- maybe under a lot of concurrency or something. Your question prompts me to post something I had been wondering. Might it be worthwhile to break the PD_ALL_VISIBLE / vm equivalence? Should the vm bit get cleared by a HOT update? Currently the vm is used only for non-freezing vacuuming and index-only scans, as far as I know. For the IOS, it doesn't care which tuple in the chain is visible, as long as exactly one of them is. So everything it cares about is still visible. And anyone can vacuum a block that has only had HOT updates, you don't need to be dedicated vacuum worker to do that. I guess one limitation of this as that if the block was populated by large tuples which then systematically got made smaller by HOT updates, the free space map would never get updated until the next vacuum freeze kicked in. And obviously this would be incompatible with removing the PD_ALL_VISIBLE, unless we also wanted to eliminate the ability to short-cut hint bit checks. Cheers, Jeff
On Fri, 2012-11-16 at 16:09 +0100, Andres Freund wrote: > As far as I understand the code the crash-safety aspects of the > visibilitymap currently rely on on having the knowledge that ALL_VISIBLE > has been cleared during a heap_(insert|update|delete). That allows > management of the visibilitymap without it being xlogged itself which > seems pretty important to me. It looks like the xlrec does contain a "cleared all visible" flag in it, and it uses that to clear the VM as well as PD_ALL_VISIBLE. So, it still seems achievable, and maybe not even a large patch, to do away with PD_ALL_VISIBLE without a lot of performance risk. Regards,Jeff Davis
On Fri, 2012-11-16 at 17:04 -0800, Jeff Janes wrote: > On Thu, Nov 15, 2012 at 4:42 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced > > in the visibility map patch, apparently as a way to know when to clear > > the VM bit when doing an update. It was then also used for scans, which > > showed a significant speedup. But I wonder: why not just use the > > visibilitymap directly from those places? It can be used for the scan > > because it is crash safe now (not possible before). And since it's only > > one lookup per scanned page, > > Wouldn't you need to lock two pages simultaneously, for each table > page, in order to ensure that there are no races? > > For a full table scan, that might not be so good. Can you explain? The table scan share locks the heap page during the visibility check (which should suffice as a memory barrier) and then could test the VM bit with only a pin on the VM page. > Your question prompts me to post something I had been wondering. > Might it be worthwhile to break the PD_ALL_VISIBLE / vm equivalence? > Should the vm bit get cleared by a HOT update? To clarify: are you saying that a hot update should clear the PD_ALL_VISIBLE bit, but not the VM bit? > And anyone can vacuum a block that has only had HOT updates, you don't > need to be dedicated vacuum worker to do that. > And obviously this would be incompatible with removing the > PD_ALL_VISIBLE, unless we also wanted to eliminate the ability to > short-cut hint bit checks. I'm still a little unclear on what the benefit is. It sounds like a slightly different kind of hint, so maybe we should just treat it as a completely different thing after removing PD_ALL_VISIBLE. If it's related to HOT updates, then the page will probably be dirty anyway, so that removes my primary complaint about PD_ALL_VISIBLE. Regards,Jeff Davis
On 2012-11-16 17:19:23 -0800, Jeff Davis wrote: > On Fri, 2012-11-16 at 16:09 +0100, Andres Freund wrote: > > As far as I understand the code the crash-safety aspects of the > > visibilitymap currently rely on on having the knowledge that ALL_VISIBLE > > has been cleared during a heap_(insert|update|delete). That allows > > management of the visibilitymap without it being xlogged itself which > > seems pretty important to me. > > It looks like the xlrec does contain a "cleared all visible" flag in it, > and it uses that to clear the VM as well as PD_ALL_VISIBLE. I think the point is that to check whether the visibilitymap bit needs to be unset - and thus locked exlusively - no locks have to be acquired but those heap_* already has. Given that in a the large amount of cases ALL_VISIBLE does *not* need to be reset I think that this is a very important property for concurrency purposes. If you consider the large amount of pages that are covered by a single visibilitymap page we don't want more locking in that path... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 2012-11-17 at 14:24 +0100, Andres Freund wrote: > I think the point is that to check whether the visibilitymap bit needs > to be unset What's the problem with that? If you already have the VM buffer pinned (which should be possible if we keep the VM buffer in a longer-lived structure), then doing the test is almost as cheap as checking PD_ALL_VISIBLE, because you don't need any locks. So, the proposal is: 1. Keep the VM buffer around in a longer-lived structure for scans and nodeModifyTable. 2. Replace all tests of PD_ALL_VISIBLE with tests directly against the VM, hopefully using a buffer that we already have a pin on. I haven't really dug into this yet, but I don't see any obvious problem. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > What's the problem with that? If you already have the VM buffer pinned > (which should be possible if we keep the VM buffer in a longer-lived > structure), then doing the test is almost as cheap as checking > PD_ALL_VISIBLE, because you don't need any locks. Really? What about race conditions? Specifically, I think what you suggest is likely to be unreliable on machines with weak memory ordering. Consider possibility that someone else just changed the VM bit. Getting a lock ensures synchronization. (Yeah, it's possible that we could use some primitive cheaper than a lock ... but it's not going to be free.) regards, tom lane
On 16 November 2012 19:58, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2012-11-16 at 11:58 -0500, Robert Haas wrote: >> > Also, I am wondering about PD_ALL_VISIBLE. It was originally introduced >> > in the visibility map patch, apparently as a way to know when to clear >> > the VM bit when doing an update. It was then also used for scans, which >> > showed a significant speedup. But I wonder: why not just use the >> > visibilitymap directly from those places? >> >> Well, you'd have to look up, lock and pin the page to do that. I >> suspect that overhead is pretty significant. The benefit of noticing >> that the flag is set is that you need not call HeapTupleSatisfiesMVCC >> for each tuple on the page: checking one bit in the page header is a >> lot cheaper than calling that function for every tuple. However, if >> you had to lock and pin a second page in order to check whether the >> page is all-visible, I suspect it wouldn't be a win; you'd probably be >> better off just doing the HeapTupleSatisfiesMVCC checks for each >> tuple. > > That's pretty easy to test. Here's what I got on a 10M record table > (Some runs got some strangely high numbers around 1700ms, which I assume > is because it's difficult to keep the data in shared buffers, so I took > the lower numbers.): > > PD_ALL_VISIBLE: 661ms > VM Lookup: 667ms > Neither: 740ms > > Even if pinning the vm buffer were slow, we could keep the pin longer > during a scan (it seems like the VM API is designed for that kind of a > use case), so I don't think scans are a problem at all, even if there is > a lot of concurrency. The biggest problem with hint bits is SeqScans on a table that ends up dirtying many pages. Repeated checks against clog and hint bit setting are massive overheads for the user that hits that, plus it generates an unexpected surge of database writes. Even without checksums that is annoying. ISTM that we should tune that specifically by performing a VM lookup for next 32 pages (or more), so we reduce the lookups well below 1 per page. That way the overhead of using the VM will be similar to using the PD_ALL_VISIBLE. Also, if we pass through a flag to HeapTupleSateisfies indicating we are not interested in setting hints on a SeqScan then we can skip individual tuple hints also. If the whole page becomes visible then we can set the VM. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 2012-11-17 at 16:53 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > What's the problem with that? If you already have the VM buffer pinned > > (which should be possible if we keep the VM buffer in a longer-lived > > structure), then doing the test is almost as cheap as checking > > PD_ALL_VISIBLE, because you don't need any locks. > > Really? What about race conditions? Specifically, I think what you > suggest is likely to be unreliable on machines with weak memory > ordering. Consider possibility that someone else just changed the VM > bit. Getting a lock ensures synchronization. (Yeah, it's possible that > we could use some primitive cheaper than a lock ... but it's not going > to be free.) There's already a similar precedent in IndexOnlyNext, which calls visibilitymap_test with no lock. I am not authoritative on these kinds of lockless accesses, but it looks like we can satisfy those memory barrier requirements in the places we need to. Here is my analysis: Process A (process that clears a VM bit for a data page): 1. Acquires exclusive lock on data buffer 2. Acquires exclusivelock on VM buffer 3. clears VM bit 4. Releases VM buffer lock 5. Releases data buffer lock Process B (process that tests the VM bit for the same data page): 1. Acquires shared lock (if it's a scan doing a visibilitytest) or an exclusive lock (if it's an I/U/D that wants to know whether to clear the bit or not) on the data buffer. 2. Tests bit using an already-pinned VM buffer. 3. Releases data buffer lock. Process A and B must be serialized, because A takes an exclusive lock on the data buffer and B takes at least a shared lock on the data buffer. The only dangerous case is when A happens right before B. So, the question is: are there enough memory barriers between A-3 and B-2? And I think the answer is yes. A-4 should act as a write barrier after clearing the bit, and B-1 should act as a read barrier before reading the bit. Let me know if there is a flaw with this analysis. If not, then I still agree with you that it's not as cheap as testing PD_ALL_VISIBLE, but I am skeptical that memory-ordering constraints we're imposing on the CPU are expensive enough to matter in these cases. If you have a test case in mind that might exercise this, then I will try to run it (although my workstation is only 4 cores, and the most I can probably get access to is 16 cores). Regards,Jeff Davis
On Sat, 2012-11-17 at 19:35 -0500, Simon Riggs wrote: > The biggest problem with hint bits is SeqScans on a table that ends up > dirtying many pages. Repeated checks against clog and hint bit setting > are massive overheads for the user that hits that, plus it generates > an unexpected surge of database writes. Even without checksums that is > annoying. Yeah. I am nowhere close to a general solution for that, but I am targeting the PD_ALL_VISIBLE hint for removal (which is one part of the problem), and I think I am close to an approach with no measurable downside. > ISTM that we should tune that specifically by performing a VM lookup > for next 32 pages (or more), so we reduce the lookups well below 1 per > page. That way the overhead of using the VM will be similar to using > the PD_ALL_VISIBLE. That's another potential way to mitigate the effects during a scan, but it does add a little complexity. Right now, it share locks a buffer, and uses an array with one element for each tuple in the page. If PD_ALL_VISIBLE is set, then it marks all of the tuples *currently present* on the page as visible in the array, and then releases the share lock. Then, when reading the page, if another tuple is added (because we released the share lock and only have a pin), it doesn't matter because it's already invisible according to the array. With this approach, we'd need to keep a larger array to represent many pages. And it sounds like we'd need to share lock the pages ahead, and find out which items are currently present, in order to properly fill in the array. Not quite sure what to do there, but would require some more thought. I'm inclined to avoid going down this path unless there is some performance reason to do so. We can keep a VM buffer pinned and do some lockless testing (similar to that in IndexOnlyNext; see my response to Tom), which will hopefully be fast enough that we don't need anything else. > Also, if we pass through a flag to > HeapTupleSateisfies indicating we are not interested in setting hints > on a SeqScan then we can skip individual tuple hints also. If the > whole page becomes visible then we can set the VM. Hmm, that's an idea. Maybe we shouldn't bother setting the hints if it's already all-visible in the VM? Something else to think about. Regards,Jeff Davis
On Fri, Nov 16, 2012 at 5:35 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2012-11-16 at 17:04 -0800, Jeff Janes wrote: > >> Your question prompts me to post something I had been wondering. >> Might it be worthwhile to break the PD_ALL_VISIBLE / vm equivalence? >> Should the vm bit get cleared by a HOT update? > > To clarify: are you saying that a hot update should clear the > PD_ALL_VISIBLE bit, but not the VM bit? Yes. > >> And anyone can vacuum a block that has only had HOT updates, you don't >> need to be dedicated vacuum worker to do that. > >> And obviously this would be incompatible with removing the >> PD_ALL_VISIBLE, unless we also wanted to eliminate the ability to >> short-cut hint bit checks. > > I'm still a little unclear on what the benefit is. The benefit would be that index only scans would be more likely to not need to visit the heap page, if it only had HOT updates done to it since the last time it was all-visible. Also some reduced vacuuming, but I don't know if that benefit would be beneficial. > It sounds like a slightly different kind of hint, so maybe we should > just treat it as a completely different thing after removing > PD_ALL_VISIBLE. If it's related to HOT updates, then the page will > probably be dirty anyway, so that removes my primary complaint about > PD_ALL_VISIBLE. Right, but if the HOT-only bit was on the page itself, it would no longer help out index-only-scans. Cheers, Jeff
On 17 November 2012 21:20, Jeff Davis <pgsql@j-davis.com> wrote: >> ISTM that we should tune that specifically by performing a VM lookup >> for next 32 pages (or more), so we reduce the lookups well below 1 per >> page. That way the overhead of using the VM will be similar to using >> the PD_ALL_VISIBLE. > > That's another potential way to mitigate the effects during a scan, but > it does add a little complexity. Right now, it share locks a buffer, and > uses an array with one element for each tuple in the page. If > PD_ALL_VISIBLE is set, then it marks all of the tuples *currently > present* on the page as visible in the array, and then releases the > share lock. Then, when reading the page, if another tuple is added > (because we released the share lock and only have a pin), it doesn't > matter because it's already invisible according to the array. > > With this approach, we'd need to keep a larger array to represent many > pages. And it sounds like we'd need to share lock the pages ahead, and > find out which items are currently present, in order to properly fill in > the array. Not quite sure what to do there, but would require some more > thought. Hmm, that's too much and not really what I was thinking, but I concede that was a little vague. No need for bigger arrays etc.. If we check the VM for next N blocks, then we know that all completed transactions are commited. Yes, the VM can change, but that is not a problem. What I mean is that we keep an array of boolean[N] that simply tracks what the VM said last time we checked it. If that is true for a block then we do special processing, similar to the current all-visible path and yet different, desribed below. What we want is to do a HeapTupleVisibility check that does not rely on tuple hints AND yet avoids all clog access. So when we scan a buffer in page mode and we know the VM said it was all visible we still check each tuple's visibility. If xid is below snapshot xmin then the xid is known committed and the tuple is visible to this scan (not necessarily all scans). We know this because the VM said this page was all-visible AFTER our snapshot was taken. If tuple xid is within snapshot or greater than snapshot xmax then the tuple is invisible to our snapshot and we don't need to check clog. So once we know the VM said the page was all visible we do not need to check clog to establish visibility, we only need to check the tuple xmin against our snapshot xmin. So the VM can change under us and it doesn't matter. We don't need a pin or lock on the VM, we just read it and let go. No race conditions, no fuss. The difference here is that we still need to check visibility of each tuple, but that can be a very cheap check and never involves clog, nor does it dirty the page. Tuple access is reasonably expensive in comparison with a clog-less check on tuple xmin against snapshot xmin, so the extra work is negligible. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sunday, November 18, 2012 03:07:01 AM Jeff Davis wrote: > Process A (process that clears a VM bit for a data page): > 1. Acquires exclusive lock on data buffer > 2. Acquires exclusive lock on VM buffer > 3. clears VM bit > 4. Releases VM buffer lock > 5. Releases data buffer lock Well, but right this is a rather big difference. If vm pages get unconditionally locked all the time we will have a huge source of new contention as they are shared between so many heap pages. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 18 November 2012 08:52, Simon Riggs <simon@2ndquadrant.com> wrote: > The difference here is that we still need to check visibility of each > tuple, but that can be a very cheap check and never involves clog, nor > does it dirty the page. Tuple access is reasonably expensive in > comparison with a clog-less check on tuple xmin against snapshot xmin, > so the extra work is negligible. The attached *partial* patch shows how the tuple checks would work. This should fit in neatly with the vismap skip code you've got already. Looks to me possible to skip the all-vis hint completely, as you suggest, plus avoid repeated checks of VM or clog. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sun, 2012-11-18 at 15:19 +0100, Andres Freund wrote: > On Sunday, November 18, 2012 03:07:01 AM Jeff Davis wrote: > > Process A (process that clears a VM bit for a data page): > > 1. Acquires exclusive lock on data buffer > > 2. Acquires exclusive lock on VM buffer > > 3. clears VM bit > > 4. Releases VM buffer lock > > 5. Releases data buffer lock > > Well, but right this is a rather big difference. If vm pages get > unconditionally locked all the time we will have a huge source of new > contention as they are shared between so many heap pages. No, that is only for the process *clearing* the bit, and this already happens. I am not planning on introducing any new locks, aside from the buffer header lock when acquiring a pin. And I plan to keep those pins for long enough that those don't matter, either. Regards,Jeff Davis
On Sun, Nov 18, 2012 at 10:43 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Sun, 2012-11-18 at 15:19 +0100, Andres Freund wrote:No, that is only for the process *clearing* the bit, and this already
> On Sunday, November 18, 2012 03:07:01 AM Jeff Davis wrote:
> > Process A (process that clears a VM bit for a data page):
> > 1. Acquires exclusive lock on data buffer
> > 2. Acquires exclusive lock on VM buffer
> > 3. clears VM bit
> > 4. Releases VM buffer lock
> > 5. Releases data buffer lock
>
> Well, but right this is a rather big difference. If vm pages get
> unconditionally locked all the time we will have a huge source of new
> contention as they are shared between so many heap pages.
happens. I am not planning on introducing any new locks, aside from the
buffer header lock when acquiring a pin. And I plan to keep those pins
for long enough that those don't matter, either.
Regards,
Jeff Davis
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Sorry If I am being a bit naive, but shouldnt a simple mutex work in the case when a process wants to change the VM bit in cache?
Mutex would be cheaper than locks.
Atri
--
Regards,
Atri
l'apprenant
It's quite common to load a lot of data, and then do some reads for a
while (setting hint bits and flushing them to disk), and then do a
VACUUM a while later, setting PD_ALL_VISIBLE and writing all of the
pages again. Also, if I remember correctly, Robert went to significant
effort when making the VM crash-safe to keep the PD_ALL_VISIBLE and VM
bits consistent. Maybe this was all discussed before?
All of these hint bits will have a bit more of a performance impact
after checksums are introduced (for those that use them in conjunction
with large data loads), so I'm looking for some simple ways to mitigate
those effects. What kind of worst-case tests could I construct to see if
there are worrying performance effects to removing these hint bits?
Regards,
Jeff Davis
I completely agree.In fact, that is the problem that we are trying to solve in our patch(https://commitfest.postgresql.org/action/patch_view?id=991). Essentially, we are trying to mitigate the expense of maintaining hint bits in the cases when the user loads a lot of data, does some operations such as SELECT, and deletes them all.We maintain a cache that can be used to fetch the commit status of XMAX or XMIN instead of hint bits.As the cache is single frame, it has no issues in replacement algorithm.Cache lookup is pretty cheap.
I agree with the removal of PD_ALL_VISIBLE.AFAIK(pardon me if I am wrong, I have been trying to research while following this thread), PD_AL_VISIBLE was really useful when VM bits were not really safe, and crashes could lead to redo setting the bit on the heap pages.
Atri
--
Regards,
Atri
l'apprenant
On Mon, 2012-11-19 at 23:50 +0530, Atri Sharma wrote: > Sorry If I am being a bit naive, but shouldnt a simple mutex work in > the case when a process wants to change the VM bit in cache? > > Mutex would be cheaper than locks. > I thought mutexes are locks? The point is to avoid taking new locks (or mutexes) during a read of the VM bit, because there is concern that it could cause contention. If we lock the entire VM page, that represents many, many data pages, so it's worrisome. Regards,Jeff Davis
On Tue, Nov 20, 2012 at 12:08 AM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2012-11-19 at 23:50 +0530, Atri Sharma wrote:I thought mutexes are locks?
> Sorry If I am being a bit naive, but shouldnt a simple mutex work in
> the case when a process wants to change the VM bit in cache?
>
> Mutex would be cheaper than locks.
>
The point is to avoid taking new locks (or mutexes) during a read of the
VM bit, because there is concern that it could cause contention. If we
lock the entire VM page, that represents many, many data pages, so it's
worrisome.
Regards,
Jeff Davis
My mistake...I thought we were more concerned about the cost of locking.
I agree, locking many data pages simultaneously could be hazardous.
This requires more thought.Maybe removing PD_ALL_VISIBLE isnt such a great idea after all...
Atri
--
Regards,
Atri
l'apprenant
On Tue, 2012-11-20 at 00:13 +0530, Atri Sharma wrote: > My mistake...I thought we were more concerned about the cost of > locking. > > I agree, locking many data pages simultaneously could be hazardous. > > This requires more thought.Maybe removing PD_ALL_VISIBLE isnt such a > great idea after all... As I said elsewhere in the thread, I'm not planning to introduce any additional locking. There is already precedent in IndexOnlyNext. Regards,Jeff Davis
On 2012-11-19 10:46:37 -0800, Jeff Davis wrote: > On Tue, 2012-11-20 at 00:13 +0530, Atri Sharma wrote: > > > My mistake...I thought we were more concerned about the cost of > > locking. > > > > I agree, locking many data pages simultaneously could be hazardous. > > > > This requires more thought.Maybe removing PD_ALL_VISIBLE isnt such a > > great idea after all... > > As I said elsewhere in the thread, I'm not planning to introduce any > additional locking. There is already precedent in IndexOnlyNext. Yea, but that precedent is only valid because IOS does check for a MVCC snapshots. Also it doesn't set anything. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Jeff Davis <pgsql@j-davis.com> writes: > As I said elsewhere in the thread, I'm not planning to introduce any > additional locking. There is already precedent in IndexOnlyNext. Of course, that just begs the question of whether the code in IndexOnlyNext is correct. And after looking at it, it seems pretty broken, or at least the argument for its correctness is broken. That comment seems to consider only the case where the target tuple has just been inserted --- but what of the case where the target tuple is in process of being deleted? The deleter will not make a new index entry for that. Of course, there's a pretty long code path from marking a tuple deleted to committing the deletion, and it seems safe to assume that the deleter will hit a write barrier in that path. But I don't believe the argument here that the reader must hit a read barrier in between. regards, tom lane
On Thu, Nov 15, 2012 at 10:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 November 2012 22:21, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> Removing those 3 hints would give us 3 more flag bits (eventually, after >>> we are sure they aren't just leftover), and it would also reduce the >>> chance that a page is dirtied for no other reason than to set them. >> >> We aren't pressed for flag bits particularly. I think the main >> attraction of this idea is precisely to reduce unnecessary page dirties, >> and so that leads me to suggest a variant: keep the four bits defined as >> now, but do not attempt to set XMIN_INVALID or XMAX_COMMITTED unless the >> page is already dirty. This would make it a straight-up trade of more >> clog consultation for fewer page dirties. > > Hmm, I thought Alvaro wanted an extra flag bit for foreign key locks > but couldn't find it. > > Come to think of it, why do we have XMIN_INVALID and XMAX_INVALID? We > never need both at the same time, so we can just use one... > XMIN_INVALID -> XACT_INVALID > XMAX_INVALID == XMIN_COMMITTED | XACT_INVALID Hm, I wonder if you could squeeze two bits out. ISTM here are the interesting cases enumerated: 0: xmin unknown 1: xmin invalid 2: xmin valid, xmax unknown 3: xmin valid, xmax invalid 4: xmin valid, xmax valid Did I miss any? If not, I think case #3 could be covered by utilizing xmax == InvalidTransactionId or simply ignored. That makes the check a little dirtier than a bit test though, but you could be sneaky and map both xmin=valid cases to a bit. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > Hm, I wonder if you could squeeze two bits out. ISTM here are the > interesting cases enumerated: > 0: xmin unknown > 1: xmin invalid > 2: xmin valid, xmax unknown > 3: xmin valid, xmax invalid > 4: xmin valid, xmax valid > Did I miss any? Yes. xmin unknown, xmax unknown is possible and different from all the above, ie a tuple can be deleted by the creating transaction. (But it could still be visible to some of that transaction's snapshots, so you can't equate this state to "xmin invalid".) There's a fairly big problem with any of these ideas, and it's not even on-disk compatibility. It is that we assume that hint bits can be set without exclusive lock on the buffer. If any of the transitions xmin unknown -> xmin committed, xmin unknown -> xmin aborted, xmax unknown -> xmax committed, xmax unknown -> xmax aborted aren't expressed by setting a bit that wasn't set before, we probably lose that property, and thereby a whole lot of concurrency. regards, tom lane
On Mon, 2012-11-19 at 14:46 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > As I said elsewhere in the thread, I'm not planning to introduce any > > additional locking. There is already precedent in IndexOnlyNext. > > Of course, that just begs the question of whether the code in > IndexOnlyNext is correct. And after looking at it, it seems pretty > broken, or at least the argument for its correctness is broken. > That comment seems to consider only the case where the target tuple has > just been inserted --- but what of the case where the target tuple is in > process of being deleted? The deleter will not make a new index entry > for that. Of course, there's a pretty long code path from marking a > tuple deleted to committing the deletion, and it seems safe to assume > that the deleter will hit a write barrier in that path. But I don't > believe the argument here that the reader must hit a read barrier in > between. After digging in a little, this is bothering me now, too. I think it's correct, and I agree with your analysis, but it seems a little complex to explain why it works. It also seems like a fortunate coincidence that we can detect clearing the bit due to an insert, which we need to know about immediately; but not detect a delete, which we don't care about until it commits. I will try to update the comment along with my submission. Regards,Jeff Davis
On Mon, Nov 19, 2012 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> As I said elsewhere in the thread, I'm not planning to introduce any >> additional locking. There is already precedent in IndexOnlyNext. > > Of course, that just begs the question of whether the code in > IndexOnlyNext is correct. And after looking at it, it seems pretty > broken, or at least the argument for its correctness is broken. > That comment seems to consider only the case where the target tuple has > just been inserted --- but what of the case where the target tuple is in > process of being deleted? The deleter will not make a new index entry > for that. Of course, there's a pretty long code path from marking a > tuple deleted to committing the deletion, and it seems safe to assume > that the deleter will hit a write barrier in that path. But I don't > believe the argument here that the reader must hit a read barrier in > between. Ouch. I don't know how I missed that case when writing that comment, but I agree it should have been discussed there. The timing requirements for updates and deletes seem to be generally less strict, because the update or delete won't generally be visible to our snapshot anyway, so it doesn't really matter if we think the page is all-visible for slightly longer than it really is. However, it would be a problem if we thought the page was still all-visible when in fact it contained a tuple that was not visible to our snapshot. For that to happen, the updating or deleting transaction would have to commit before we took our snapshot, yet the visibility map update would have to still be visible after we took our snapshot. That clearly can't happen, since both committing and taking a snapshot require LWLock acquire/release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company