Thread: Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)
Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)
From
Peter Geoghegan
Date:
On Wed, Mar 9, 2022 at 4:46 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-03-03 19:31:32 -0800, Peter Geoghegan wrote: > > Attached is a new revision of my fix. This is more or less a > > combination of my v4 fix from November 12 [1] and Andres' > > already-committed fix (commit 18b87b20), rebased on top of HEAD. This > > patch was originally a bugfix, but now it's technically just > > refactoring/hardening of the logic in pruneheap.c. It hasn't changed > > all that much, though. > > Perhaps worth discussing outside of -bugs? Okay. Replying on a new -hackers thread dedicated to the hardening patch. Attached is v6, which goes a bit further than v5 in using local state that we build up-front to describe the state of the page being pruned (rather than rereading the page itself). I'll address your v5 review comments now. > > We now do "3 passes" over the page. The first is the aforementioned > > "precalculate HTSV" pass, the second is for determining the extent of > > HOT chains, and the third is for any remaining disconnected/orphaned > > HOT chains. I suppose that that's okay, since the amount of work has > > hardly increased in proportion to this "extra pass over the page". Not > > 100% sure about everything myself right now, though. I guess that "3 > > passes" might be considered excessive. > > We should be able to optimize away the third pass in the majority of cases, by > keeping track of the number of tuples visited or such. That seems like it > might be worth doing? It independently occured to me to go further with using the state from the first pass to save work in the second and third pass. We still have what looks like a third pass over the page in v6, but it doesn't really work that way. That is, we only need to rely on local state that describes the page, which is conveniently available already. We don't need to look at the physical page in the so-called third pass. > > + /* > > + * Start from the root item. Mark it as valid up front, since root items > > + * are always processed here (not as disconnected tuples in third pass > > + * over page). > > + */ > > + prstate->visited[rootoffnum] = true; > > offnum = rootoffnum; > > + nchain = 0; > > I wonder if it'd be clearer if we dealt with redirects outside of the > loop. I found it more natural to deal with everything outside of the entire function (the heap_prune_chain function, which I've renamed to heap_prune_from_root in v6). This is kind of what you suggested anyway, since the function itself is stripped down to just the loop in v6. > Would make it easier to assert that the target of a redirect may not be > unused / !heap-only? With the approach is v6 we always eliminate LP_DEAD and LP_UNUSED items up-front (by considering them "visited" in the first pass over the page). We're also able to eliminate not-heap-only tuples quickly, since whether or not each item is a heap-only tuple is also recorded in the first pass (the same pass that gets the HTSV status of each item). It seemed natural to give more responsibility to the caller, heap_page_prune, which is what this really is. This continues the trend you started in bugfix commit 18b87b20, which added the initial loop for the HTSV calls. In v6 it becomes heap_page_prune's responsibility to only call heap_prune_from_root with an item that is either an LP_REDIRECT item, or a plain heap tuple (not a heap-only tuple). The new bookkeeping state gathered in our first pass over the page makes this easy. I'm not entirely sure that it makes sense to go this far. We do need an extra array of booleans to make this work (the new "heaponly[]" array in PruneState), which will need to be stored on the stack. What do you think of that aspect? > > + /* > > + * Remember the offnum of the last DEAD tuple in this HOT > > + * chain. To keep things simple, don't treat heap-only tuples > > + * from a HOT chain as DEAD unless they're only preceded by > > + * other DEAD tuples (in addition to actually being DEAD). > > s/actually/themselves/? Fixed. > > + * Remaining tuples that appear DEAD (but don't get treated as > > + * such by us) are from concurrently aborting updaters. > > I don't understand this bit. A DEAD tuple following a RECENTLY_DEAD one won't > be removed now, and doesn't need to involve a concurrent abort? Are you > thinking of "remaining" as the tuples not referenced in the previous sentences? This was just brainfade on my part. I think that I messed it up during rebasing. > > + * VACUUM will ask us to prune the heap page again when it > > + * sees that there is a DEAD tuple left behind, but that would > > + * be necessary regardless of our approach here. > > + */ > > Only as long as we do another set of HTSV calls... Removed. > > case HEAPTUPLE_LIVE: > > case HEAPTUPLE_INSERT_IN_PROGRESS: > > + pastlatestdead = true; /* no further DEAD tuples in CHAIN */ > > If we don't do anything to the following tuples, why don't we just abort here? > I assume it is because we'd then treat them as disconnected? That should > probably be called out... Good point. Fixed. > > - } > > - else if (nchain < 2 && ItemIdIsRedirected(rootlp)) > > - { > > - /* > > - * We found a redirect item that doesn't point to a valid follow-on > > - * item. This can happen if the loop in heap_page_prune caused us to > > - * visit the dead successor of a redirect item before visiting the > > - * redirect item. We can clean up by setting the redirect item to > > - * DEAD state. > > - */ > > - heap_prune_record_dead(prstate, rootoffnum); > > + > > + return ndeleted; > > } > > Could there be such tuples from before a pg_upgrade? Do we need to deal with > them somehow? BTW, this code (which I now propose to more or less remove) predates commit 0a469c87, which removed old style VACUUM full back in 2010. Prior to that commit there was an extra block that followed this one -- "else if (redirect_move && ItemIdIsRedirected(rootlp) { ... }". Once you consider that old style VACUUM FULL once had to rewrite LP_REDIRECT items then this code structure makes a bit more sense. Another possibly-relevant piece of historic context about this hunk of code: it was originally part of a critical section -- structuring that added heap_page_prune_execute() came later, in commit 6f10eb2111. Anyway, to answer your question: I don't believe that there will be any such pre-pg_upgrade tuples. But let's assume that I'm wrong about that; what can be done about it now? We're talking about a root LP_REDIRECT item that points to some item that just doesn't appear sane for it to point to (e.g. maybe it points past the end of the line pointer array, or to a not-heap-only tuple). If we are ever able to even *detect* such corruption using tools like amcheck, then that's just because we got lucky with the details. The item is irredeemably corrupt, no matter what we do. To be clear, I'm not saying that you're wrong -- maybe I do need more handling for this case. In any case I haven't quite figured out where to draw the line with visibly corrupt HOT chains, even in v6. All I'm saying is that the old code doesn't seem to tell us anything about what I ought to replace it with now. It tells us nothing about what should be possible with LP_REDIRECT items in general, with pg_upgrade. The structure doesn't even suggest that somebody once believed that it was acceptable for an existing LP_REDIRECT item to redirect to something other than a heap-only tuple. And even if it did suggest that, it would still be a wildly unreasonable thing for anybody to have ever believed IMV. As I said, the item has to be considered corrupt, no matter what might have been possible in the past. -- Peter Geoghegan
Attachment
Re: Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)
From
Matthias van de Meent
Date:
On Sun, 20 Mar 2022 at 04:48, Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v6, which goes a bit further than v5 in using local state > that we build up-front to describe the state of the page being pruned > (rather than rereading the page itself). I didn't test the code; so these comments are my general feel of this patch based on visual analysis only. > > > We now do "3 passes" over the page. The first is the aforementioned > > > "precalculate HTSV" pass, the second is for determining the extent of > > > HOT chains, and the third is for any remaining disconnected/orphaned > > > HOT chains. I suppose that that's okay, since the amount of work has > > > hardly increased in proportion to this "extra pass over the page". Not > > > 100% sure about everything myself right now, though. I guess that "3 > > > passes" might be considered excessive. The passes don't all have a very clear explanation what they do; or what they leave for the next one to process. It can be distilled from the code comments at the respective phases and various functions, but only after careful review of all code comments; there doesn't seem to be a clear overview. > @@ -295,30 +305,25 @@ heap_page_prune(Relation relation, Buffer buffer, > [...] > + * prstate for later passes. Scan the page backwards (in reverse item > + * offset number order). > - [...] > + * This approach is good for performance. Most commonly tuples within a > + * page are stored at decreasing offsets (while the items are stored at > + * increasing offsets). A reference to why this is commonly the case (i.e. PageRepairFragmentation, compactify_tuples, natural insertion order) would probably help make the case; as this order being common is not specifically obvious at first sight. > @@ -350,28 +370,41 @@ heap_page_prune(Relation relation, Buffer buffer, > + /* Now scan the page a second time to process each root item */ This second pass also processes the HOT chain of each root item; but that is not clear from the comment on the loop. I'd suggest a comment more along these lines: /* * Now scan the page a second time to process each valid HOT chain; * i.e. each non-HOT tuple or redirect line pointer and the HOT tuples in * the trailing chain, if any. * heap_prune_from_root marks all items in the HOT chain as visited; * so that phase 3 knows to skip those items. */ Apart from changes in comments for extra clarity; I think this materially improves pruning, so thanks for working on this. -Matthias