Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAH2-WzkP=9CUYmbCi7Yzopsx380kaPHi686Sy22Skhwv4nCfOg@mail.gmail.com Whole thread Raw |
In response to | Re: 64-bit XIDs in deleted nbtree pages (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages |
List | pgsql-hackers |
On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Yes. I think this would simplify the problem by resolving almost all > problems related to indefinite deferring page recycle. > > We will be able to recycle almost all just-deleted pages in practice > especially when btvacuumscan() took a long time. And there would not > be a noticeable downside, I think. Great! > BTW if btree index starts to use maintenan_work_mem for this purpose, > we also need to set amusemaintenanceworkmem to true which is > considered when parallel vacuum. I was just going to use work_mem. This should be okay. Note that CREATE INDEX uses an additional work_mem allocation when building a unique index, for the second spool/tuplesort. That seems like a precedent that I can follow here. Right now the BTPendingRecycle struct the patch uses to store information about a page that the current VACUUM deleted (and may yet be able to place in the FSM) are each 16 bytes (including alignment overhead). I could probably make them smaller with a little work, but even now that's quite small. Even with the default 4MiB work_mem setting we can fit information about 262144 pages all at once. That's 2GiB worth of deleted index pages, which is generally much more than we'll need. > Yeah, increasing the threshold would solve the problem in most cases. > Given that nbtree index page deletion is unlikely to happen in > practice, having the threshold 5% or 10% seems to avoid the problem in > nearly 100% of cases, I think. Of course it all depends on workload/index characteristics, in the end. It is very rare to delete a percentage of the index that exceeds autovacuum_vacuum_scale_factor -- that's the important thing here IMV. > Another idea I come up with (maybe on top of above your idea) is to > change btm_oldest_btpo_xact to 64-bit XID and store the *newest* > btpo.xact XID among all deleted pages when the total amount of deleted > pages exceeds 2% of index. That way, we surely can recycle more than > 2% of index when the XID becomes older than the global xmin. You could make my basic approach to recycling deleted pages earlier (ideally at the end of the same btvacuumscan() that deleted the pages in the first place) more sophisticated in a variety of ways. These are all subject to diminishing returns, though. I've already managed to recycle close to 100% of all B-Tree pages during the same VACUUM with a very simple approach -- at least if we assume BenchmarkSQL is representative. It is hard to know how much more effort can be justified. To be clear, I'm not saying that an improved design cannot be justified now or in the future (BenchmarkSQL is not like many workloads that people use Postgres for). I'm just saying that I *don't know* where to draw the line. Any particular place that we draw the line feels a little arbitrary to me. This includes my own choice of the work_mem-limited BTPendingRecycle array. My patch currently works that way because it's simple -- no other reason. Any scheme to further improve the "work_mem-limited BTPendingRecycle array" design from my patch boils down to this: A new approach that makes recycling of any remaining deleted pages take place "before too long": After the end of the btvacuumscan() BTPendingRecycle array stuff (presumably that didn't work out in cases where an improved approach matters), but before the next VACUUM takes place (since that will do the required recycling anyway, unless it's unable to do any work at all, in which case it hardly matters). Here are two ideas of my own in this same class as your idea: 1. Remember to do some of the BTPendingRecycle array FSM processing stuff in btvacuumcleanup() -- defer some of the recycling of pages recorded in BTPendingRecycle entries (paged deleted during btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called. Right now btvacuumcleanup() will always do nothing when btbulkdelete() was called earlier. But that's just a current nbtree convention, and is no reason to not do this (we don't have to scan the index again at all). The advantage of teaching btvacuumcleanup() to do this is that it delays the "BTPendingRecycle array FSM processing" stuff until the last moment that it is still easy to use the in-memory array (because we haven't freed it yet). In general, doing it later makes it more likely that we'll successfully recycle the pages. Though in general it might not make any difference -- so we're still hoping that the workload allows us to recycle everything we deleted, without making the design much more complicated than what I posted already. (BTW I see that you reviewed commit 4e514c61, so you must have thought about the trade-off between doing deferred recycling in amvacuumcleanup() vs ambulkdelete(), when to call IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot implement this idea while calling IndexFreeSpaceMapVacuum() during both btvacuumcleanup() and btbulkdelete(), so that we get the best of both worlds -- fast recycling *and* more delayed processing that is more likely to ultimately succeed.) 2. Remember/serialize the BTPendingRecycle array when we realize that we cannot put all recyclable pages in the FSM at the end of the current btvacuumscan(), and then use an autovacuum work item to process them before too long -- a call to AutoVacuumRequestWork() could even serialize the data on disk. Idea 2 has the advantage of allowing retries -- eventually it will be safe to recycle the pages, if we just wait long enough. Anyway, I'm probably not going to pursue either of the 2 ideas for Postgres 14. I'm mentioning these ideas now because the trade-offs show that there is no perfect design for this deferring recycling stuff. Whatever we do, we should accept that there is no perfect design. Actually, there is one more reason why I bring up idea 1 now: I want to hear your thoughts on the index AM API questions now, which idea 1 touches on. Ideally all of the details around the index AM VACUUM APIs (i.e. when and where the extra work happens -- btvacuumcleanup() vs btbulkdelete()) won't need to change much in the future. I worry about getting this index AM API stuff right, at least a little. > Also, maybe we can record deleted pages to FSM even without deferring > and check it when re-using. That is, when we get a free page from FSM > we check if the page is really recyclable (maybe _bt_getbuf() already > does this?). IOW, a deleted page can be recycled only when it's > requested to be reused. If btpo.xact is 64-bit XID we never need to > worry about the case where a deleted page never be requested to be > reused. I've thought about that too (both now and in the past). You're right about _bt_getbuf() -- it checks the XID, at least on the master branch. I took that XID check out in v4 of the patch, but I am now starting to have my doubts about that particular choice. (I'm probably going to restore the XID check in _bt_getbuf in v5 of the patch.) I took the XID-is-recyclable check out in v4 of the patch because it might leak pages in rare cases -- which is not a new problem. _bt_getbuf() currently has a remarkably relaxed attitude about leaking pages from the FSM (it is more relaxed about it than I am, certainly) -- but why should we just accept leaking pages like that? My new doubts about it are non-specific, though. We know that the FSM isn't crash safe -- but I think that that reduces to "practically speaking, we can never 100% trust the FSM". Which makes me nervous. I worry that the FSM can do something completely evil and crazy in rare cases. It's not just crash safety. The FSM's fsm_search_avail() function currently changes the fp_next_slot field with only a shared buffer lock held. It's an int, which is supposed to "be atomic on most platforms". But we should be using real atomic ops. So the FSM is generally...kind of wonky. In an ideal world, nbtree page deletion + recycling would have crash safety built in. I don't think that it makes sense to not have free space management without crash safety in the case of index AMs, because it's just not worth it with whole-page units of free space (heapam is another story). A 100% crash-safe design would naturally shift the problem of nbtree page recycle safety from the producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree would be a real improvement. But these long standing FSM issues are not going to change for Postgres 14. And so changing _bt_getbuf() to do clever things with XIDs won't be possible for Postgres 14 IMV. -- Peter Geoghegan
pgsql-hackers by date: