Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAD21AoAaHg86bGm=k8cBtK9HeO46QGRMX4pxNt5gt_11ispFGA@mail.gmail.com Whole thread Raw |
In response to | Re: 64-bit XIDs in deleted nbtree pages (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: 64-bit XIDs in deleted nbtree pages
|
List | pgsql-hackers |
On Sun, Feb 14, 2021 at 3:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I agree that there already are huge problems in that case. But I think > > > we need to consider an append-only case as well; after bulk deletion > > > on an append-only table, vacuum deletes heap tuples and index tuples, > > > marking some index pages as dead and setting an XID into btpo.xact. > > > Since we trigger autovacuums even by insertions based on > > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on > > > the table again. But if there is a long-running query a "wasted" > > > cleanup scan could happen many times depending on the values of > > > autovacuum_vacuum_insert_scale_factor/threshold and > > > vacuum_cleanup_index_scale_factor. This should not happen in the old > > > code. I agree this is DBA problem but it also means this could bring > > > another new problem in a long-running query case. > > > > I see your point. > > My guess is that this concern of yours is somehow related to how we do > deletion and recycling *in general*. Currently (and even in v3 of the > patch), we assume that recycling the pages that a VACUUM operation > deletes will happen "eventually". This kind of makes sense when you > have "typical vacuuming" -- deletes/updates, and no big bursts, rare > bulk deletes, etc. > > But when you do have a mixture of different triggering positions, > which is quite possible, it is difficult to understand what > "eventually" actually means... > > > BTW, I am thinking about making recycling take place for pages that > > were deleted during the same VACUUM. We can just use a > > work_mem-limited array to remember a list of blocks that are deleted > > but not yet recyclable (plus the XID found in the block). > > ...which brings me back to this idea. > > I've prototyped this. It works really well. In most cases the > prototype makes VACUUM operations with nbtree index page deletions > also recycle the pages that were deleted, at the end of the > btvacuumscan(). We do very little or no "indefinite deferring" work > here. This has obvious advantages, of course, but it also has a > non-obvious advantage: the awkward question of concerning "what > eventually actually means" with mixed triggering conditions over time > mostly goes away. So perhaps this actually addresses your concern, > Masahiko. Yes. I think this would simplify the problem by resolving almost all problems related to indefinite deferring page recycle. We will be able to recycle almost all just-deleted pages in practice especially when btvacuumscan() took a long time. And there would not be a noticeable downside, I think. BTW if btree index starts to use maintenan_work_mem for this purpose, we also need to set amusemaintenanceworkmem to true which is considered when parallel vacuum. > > I've been testing this with BenchmarkSQL [1], which has several > indexes that regularly need page deletions. There is also a realistic > "life cycle" to the data in these indexes. I added custom > instrumentation to display information about what's going on with page > deletion when the benchmark is run. I wrote a quick-and-dirty patch > that makes log_autovacuum show the same information that you see about > index page deletion when VACUUM VERBOSE is run (including the new > pages_newly_deleted field from my patch). With this particular > TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go > on to place every page that it deletes in the FSM without leaving > anything to the next VACUUM. There are a very small number of > exceptions where we "only" manage to recycle maybe 95% of the pages > that were deleted. Great! > > The race condition that nbtree avoids by deferring recycling was > always a narrow one, outside of the extremes -- the way we defer has > always been overkill. It's almost always unnecessary to delay placing > deleted pages in the FSM until the *next* VACUUM. We only have to > delay it until the end of the *same* VACUUM -- why wait until the next > VACUUM if we don't have to? In general this deferring recycling > business has nothing to do with MVCC/GC/whatever, and yet the code > seems to suggest that it does. While it is convenient to use an XID > for page deletion and recycling as a way of implementing what Lanin & > Shasha call "the drain technique" [2], all we have to do is prevent > certain race conditions. This is all about the index itself, the data > structure, how it is maintained -- nothing more. It almost seems > obvious to me. Agreed. > > It's still possible to imagine extremes. Extremes that even the "try > to recycle pages we ourselves deleted when we reach the end of > btvacuumscan()" version of my patch cannot deal with. Maybe it really > is true that it's inherently impossible to recycle a deleted page even > at the end of a VACUUM -- maybe a long-running transaction (that could > in principle have a stale link to our deleted page) starts before we > VACUUM, and lasts after VACUUM finishes. So it's just not safe. When > that happens, we're back to having the original problem: we're relying > on some *future* VACUUM operation to do that for us at some indefinite > point in the future. It's fair to wonder: What are the implications of > that? Are we not back to square one? Don't we have the same "what does > 'eventually' really mean" problem once again? > > I think that that's okay, because this remaining case is a *truly* > extreme case (especially with a large index, where index vacuuming > will naturally take a long time). Right. > > It will be rare. But more importantly, the fact that scenario is now > an extreme case justifies treating it as an extreme case. We can teach > _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In > particular, I think that it will now be okay to increase the threshold > applied when considering deleted pages inside > _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the > patch. But in v4, which has the new recycling enhancement, I think > that it would be sensible to make it 5%, or maybe even 10%. This > naturally makes Masahiko's problem scenario unlikely to actually > result in a truly wasted call to btvacuumscan(). The number of pages > that the metapage indicates are "deleted but not yet placed in the > FSM" will be close to the theoretical minimum, because we're no longer > naively throwing away information about which specific pages will be > recyclable soon. Which is what the current approach does, really. > Yeah, increasing the threshold would solve the problem in most cases. Given that nbtree index page deletion is unlikely to happen in practice, having the threshold 5% or 10% seems to avoid the problem in nearly 100% of cases, I think. Another idea I come up with (maybe on top of above your idea) is to change btm_oldest_btpo_xact to 64-bit XID and store the *newest* btpo.xact XID among all deleted pages when the total amount of deleted pages exceeds 2% of index. That way, we surely can recycle more than 2% of index when the XID becomes older than the global xmin. Also, maybe we can record deleted pages to FSM even without deferring and check it when re-using. That is, when we get a free page from FSM we check if the page is really recyclable (maybe _bt_getbuf() already does this?). IOW, a deleted page can be recycled only when it's requested to be reused. If btpo.xact is 64-bit XID we never need to worry about the case where a deleted page never be requested to be reused. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: