Thread: heap_hot_search_buffer refactoring
The attached patch refactors heap_hot_search_buffer() so that index_getnext() can use it, and modifies index_getnext() to do so. The idea is based on one of Heikki's index-only scan patches, from 2009: http://archives.postgresql.org/pgsql-hackers/2009-07/msg00676.php I believe, however, that this portion of that patch stands alone, completely independently of index-only scans. At present, we have two copies of the logic to traverse HOT chains floating around, which means they can get out of sync, possibly resulting in bugs. This has already happened once: http://archives.postgresql.org/pgsql-hackers/2011-05/msg00733.php As a nice bonus, the new logic is both simpler and, I believe, more correct than the current logic. In IndexScanDescData, xs_hot_dead, xs_next_hot, and xs_prev_xmax get replaced by a single boolean xs_continue_hot. There is a small behavioral difference: in the current code, when use a non-MVCC snapshot with index_getnext() and walk a HOT chain, each call remembers the *next* TID that should be examined. With this change, we instead remember the TID that we most recently returned, and compute the next TID when index_getnext() is called again. It seems to me that this is really what we should have been doing all along. Imagine, for example, that we have a HOT chain A -> B, where B has not yet committed. We return A and remember that we next intend to look at B. Before index_getnext() is called, B aborts and a new HOT update produces a HOT chain A -> C. The next call to index_getnext() will nonetheless look at B and conclude that it's reached the end of the HOT chain. This doesn't actually matter a bit for current uses of index_getnext(), because - at least according to Heikki's old notes - the only place we use a non-MVCC snapshot with this function is during CLUSTER. And at that point, we have the whole table locked down, so nothing is happening concurrently. I'm not sure there's any possibility of us ever using this function in a way that would break under the current implementation, but what I'm proposing here does seem more robust to me. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, 2011-06-06 at 14:03 -0400, Robert Haas wrote: > The attached patch refactors heap_hot_search_buffer() so that > index_getnext() can use it, and modifies index_getnext() to do so. Attached is a version of the patch that applies cleanly to master. Regards, Jeff Davis
Attachment
On Sun, 2011-06-19 at 10:50 -0700, Jeff Davis wrote: > On Mon, 2011-06-06 at 14:03 -0400, Robert Haas wrote: > > The attached patch refactors heap_hot_search_buffer() so that > > index_getnext() can use it, and modifies index_getnext() to do so. > > Attached is a version of the patch that applies cleanly to master. In heap_hot_search_buffer: + /* If this is not the first call, previous call returned a (live!) tuple */ if (all_dead) - *all_dead = true; + *all_dead = !first_call; I think that's a typo: it should be: + *all_dead = first_call; Right? Regards,Jeff Davis
On Sun, Jun 19, 2011 at 2:01 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Sun, 2011-06-19 at 10:50 -0700, Jeff Davis wrote: >> On Mon, 2011-06-06 at 14:03 -0400, Robert Haas wrote: >> > The attached patch refactors heap_hot_search_buffer() so that >> > index_getnext() can use it, and modifies index_getnext() to do so. >> >> Attached is a version of the patch that applies cleanly to master. > > In heap_hot_search_buffer: > > + /* If this is not the first call, previous call returned > a (live!) tuple */ > if (all_dead) > - *all_dead = true; > + *all_dead = !first_call; > > I think that's a typo: it should be: > > + *all_dead = first_call; Yikes. I think you are right. It's kind of scary that the regression tests passed with that mistake. New patch attached, with that one-line change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Yikes. I think you are right. It's kind of scary that the regression > tests passed with that mistake. Can we add a test that exposes that mistake? regards, tom lane
On Sun, Jun 19, 2011 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yikes. I think you are right. It's kind of scary that the regression >> tests passed with that mistake. > > Can we add a test that exposes that mistake? Not sure. We'd have to figure out how to reliably tickle it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 19, 2011 at 2:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jun 19, 2011 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Yikes. I think you are right. It's kind of scary that the regression >>> tests passed with that mistake. >> >> Can we add a test that exposes that mistake? > > Not sure. We'd have to figure out how to reliably tickle it. *thinks a bit* When using an MVCC snapshot, we always have first_call = true, so the effect of this mistake was just to disable the opportunistic killing of dead tuples, which doesn't affect correctness. When using a non-MVCC snapshot, we call heap_hot_search_buffer() repeatedly until it returns false. For so long as it returns true, it does not matter how all_dead is set, because index_getnext() will return the tuple without examining all_dead. So only the final call matters. If the final call happens to also be the first call, then all_dead might end up being false when it really ought to be true, but that will once again just miss killing a dead tuple. If the final call isn't the first call, then we've got a problem, because now all_dead will be true when it really ought to be false, and we'll nuke an index tuple that we shouldn't nuke. But if this is happening in the context of CLUSTER, then there might still be no user-visible failure, because we're going to rebuild the indexes anyway. There could be a problem if CLUSTER aborts part-way though. A system catalog might get scanned with SnapshotNow, but to exercise the bug you'd need to HOT update a system catalog and then have the updating transaction commit between the time it sees the first row and the time it sees the second one. So I don't quite see how to construct a test case, ATM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 19, 2011 at 2:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > New patch attached, with that one-line change. Jeff, are you planning to review this further? Do you think it's OK to commit? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2011-06-24 at 15:32 -0400, Robert Haas wrote: > On Sun, Jun 19, 2011 at 2:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > New patch attached, with that one-line change. > > Jeff, are you planning to review this further? Do you think it's OK to commit? 1. Patch does not apply to master cleanly, and it's in unified format (so I can't compare it against the old patch very easily). This review is for the first patch, disregarding the "skip = !first_call" issue that you already fixed. If you had other changes in the latest version, please repost the patch. 2. Comment above heap_hot_search_buffer should be updated to document that heapTuple is an out-parameter and document the behavior of first_call 3. The logic around "skip" is slightly confusing to me. Here's my description: if it's not an MVCC snapshot and it's not the first call, then you don't actually want to fetch the tuple with the given tid or a later one in the chain -- you want to fetch the _next_ tuple in the chain or a later one in the chain. Some wording of that description in a comment (either in the function's comment or near the use of "skip") would help a lot. Also, if skip is true, then the tid _must_ be visible according to the (non-MVCC) snapshot, correct? It might help if that was apparent from the code/comments. Other than that, it looks good. Regards,Jeff Davis
On Sat, Jun 25, 2011 at 6:24 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2011-06-24 at 15:32 -0400, Robert Haas wrote: >> On Sun, Jun 19, 2011 at 2:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> > New patch attached, with that one-line change. >> >> Jeff, are you planning to review this further? Do you think it's OK to commit? > > 1. Patch does not apply to master cleanly, and it's in unified format > (so I can't compare it against the old patch very easily). This review > is for the first patch, disregarding the "skip = !first_call" issue that > you already fixed. If you had other changes in the latest version, > please repost the patch. That is strange, because it applies for me. But I had no other changes. > 2. Comment above heap_hot_search_buffer should be updated to document > that heapTuple is an out-parameter and document the behavior of > first_call > > 3. The logic around "skip" is slightly confusing to me. Here's my > description: if it's not an MVCC snapshot and it's not the first call, > then you don't actually want to fetch the tuple with the given tid or a > later one in the chain -- you want to fetch the _next_ tuple in the > chain or a later one in the chain. Some wording of that description in a > comment (either in the function's comment or near the use of "skip") > would help a lot. Also, if skip is true, then the tid _must_ be visible > according to the (non-MVCC) snapshot, correct? It might help if that was > apparent from the code/comments. > > Other than that, it looks good. OK, I've applied this with some additional comment changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company