Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com Whole thread Raw |
In response to | Re: 64-bit XIDs in deleted nbtree pages (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages |
List | pgsql-hackers |
On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I agree that there already are huge problems in that case. But I think > > we need to consider an append-only case as well; after bulk deletion > > on an append-only table, vacuum deletes heap tuples and index tuples, > > marking some index pages as dead and setting an XID into btpo.xact. > > Since we trigger autovacuums even by insertions based on > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on > > the table again. But if there is a long-running query a "wasted" > > cleanup scan could happen many times depending on the values of > > autovacuum_vacuum_insert_scale_factor/threshold and > > vacuum_cleanup_index_scale_factor. This should not happen in the old > > code. I agree this is DBA problem but it also means this could bring > > another new problem in a long-running query case. > > I see your point. My guess is that this concern of yours is somehow related to how we do deletion and recycling *in general*. Currently (and even in v3 of the patch), we assume that recycling the pages that a VACUUM operation deletes will happen "eventually". This kind of makes sense when you have "typical vacuuming" -- deletes/updates, and no big bursts, rare bulk deletes, etc. But when you do have a mixture of different triggering positions, which is quite possible, it is difficult to understand what "eventually" actually means... > BTW, I am thinking about making recycling take place for pages that > were deleted during the same VACUUM. We can just use a > work_mem-limited array to remember a list of blocks that are deleted > but not yet recyclable (plus the XID found in the block). ...which brings me back to this idea. I've prototyped this. It works really well. In most cases the prototype makes VACUUM operations with nbtree index page deletions also recycle the pages that were deleted, at the end of the btvacuumscan(). We do very little or no "indefinite deferring" work here. This has obvious advantages, of course, but it also has a non-obvious advantage: the awkward question of concerning "what eventually actually means" with mixed triggering conditions over time mostly goes away. So perhaps this actually addresses your concern, Masahiko. I've been testing this with BenchmarkSQL [1], which has several indexes that regularly need page deletions. There is also a realistic "life cycle" to the data in these indexes. I added custom instrumentation to display information about what's going on with page deletion when the benchmark is run. I wrote a quick-and-dirty patch that makes log_autovacuum show the same information that you see about index page deletion when VACUUM VERBOSE is run (including the new pages_newly_deleted field from my patch). With this particular TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go on to place every page that it deletes in the FSM without leaving anything to the next VACUUM. There are a very small number of exceptions where we "only" manage to recycle maybe 95% of the pages that were deleted. The race condition that nbtree avoids by deferring recycling was always a narrow one, outside of the extremes -- the way we defer has always been overkill. It's almost always unnecessary to delay placing deleted pages in the FSM until the *next* VACUUM. We only have to delay it until the end of the *same* VACUUM -- why wait until the next VACUUM if we don't have to? In general this deferring recycling business has nothing to do with MVCC/GC/whatever, and yet the code seems to suggest that it does. While it is convenient to use an XID for page deletion and recycling as a way of implementing what Lanin & Shasha call "the drain technique" [2], all we have to do is prevent certain race conditions. This is all about the index itself, the data structure, how it is maintained -- nothing more. It almost seems obvious to me. It's still possible to imagine extremes. Extremes that even the "try to recycle pages we ourselves deleted when we reach the end of btvacuumscan()" version of my patch cannot deal with. Maybe it really is true that it's inherently impossible to recycle a deleted page even at the end of a VACUUM -- maybe a long-running transaction (that could in principle have a stale link to our deleted page) starts before we VACUUM, and lasts after VACUUM finishes. So it's just not safe. When that happens, we're back to having the original problem: we're relying on some *future* VACUUM operation to do that for us at some indefinite point in the future. It's fair to wonder: What are the implications of that? Are we not back to square one? Don't we have the same "what does 'eventually' really mean" problem once again? I think that that's okay, because this remaining case is a *truly* extreme case (especially with a large index, where index vacuuming will naturally take a long time). It will be rare. But more importantly, the fact that scenario is now an extreme case justifies treating it as an extreme case. We can teach _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In particular, I think that it will now be okay to increase the threshold applied when considering deleted pages inside _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the patch. But in v4, which has the new recycling enhancement, I think that it would be sensible to make it 5%, or maybe even 10%. This naturally makes Masahiko's problem scenario unlikely to actually result in a truly wasted call to btvacuumscan(). The number of pages that the metapage indicates are "deleted but not yet placed in the FSM" will be close to the theoretical minimum, because we're no longer naively throwing away information about which specific pages will be recyclable soon. Which is what the current approach does, really. [1] https://github.com/wieck/benchmarksql [2] https://archive.org/stream/symmetricconcurr00lani#page/8/mode/2up -- see "2.5 Freeing Empty Nodes" -- Peter Geoghegan
pgsql-hackers by date: