Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAD21AoA9g=+oE9xgzJVzS62oi8Rx8_zqyDQ3BkaTLt+Qd+_M1Q@mail.gmail.com Whole thread Raw |
In response to | Re: 64-bit XIDs in deleted nbtree pages (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: 64-bit XIDs in deleted nbtree pages
|
List | pgsql-hackers |
On Tue, Feb 16, 2021 at 12:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yes. I think this would simplify the problem by resolving almost all > > problems related to indefinite deferring page recycle. > > > > We will be able to recycle almost all just-deleted pages in practice > > especially when btvacuumscan() took a long time. And there would not > > be a noticeable downside, I think. > > Great! > > > BTW if btree index starts to use maintenan_work_mem for this purpose, > > we also need to set amusemaintenanceworkmem to true which is > > considered when parallel vacuum. > > I was just going to use work_mem. This should be okay. Note that > CREATE INDEX uses an additional work_mem allocation when building a > unique index, for the second spool/tuplesort. That seems like a > precedent that I can follow here. > > Right now the BTPendingRecycle struct the patch uses to store > information about a page that the current VACUUM deleted (and may yet > be able to place in the FSM) are each 16 bytes (including alignment > overhead). I could probably make them smaller with a little work, but > even now that's quite small. Even with the default 4MiB work_mem > setting we can fit information about 262144 pages all at once. That's > 2GiB worth of deleted index pages, which is generally much more than > we'll need. Cool. > > > Yeah, increasing the threshold would solve the problem in most cases. > > Given that nbtree index page deletion is unlikely to happen in > > practice, having the threshold 5% or 10% seems to avoid the problem in > > nearly 100% of cases, I think. > > Of course it all depends on workload/index characteristics, in the > end. It is very rare to delete a percentage of the index that exceeds > autovacuum_vacuum_scale_factor -- that's the important thing here IMV. > > > Another idea I come up with (maybe on top of above your idea) is to > > change btm_oldest_btpo_xact to 64-bit XID and store the *newest* > > btpo.xact XID among all deleted pages when the total amount of deleted > > pages exceeds 2% of index. That way, we surely can recycle more than > > 2% of index when the XID becomes older than the global xmin. > > You could make my basic approach to recycling deleted pages earlier > (ideally at the end of the same btvacuumscan() that deleted the pages > in the first place) more sophisticated in a variety of ways. These are > all subject to diminishing returns, though. > > I've already managed to recycle close to 100% of all B-Tree pages > during the same VACUUM with a very simple approach -- at least if we > assume BenchmarkSQL is representative. It is hard to know how much > more effort can be justified. To be clear, I'm not saying that an > improved design cannot be justified now or in the future (BenchmarkSQL > is not like many workloads that people use Postgres for). I'm just > saying that I *don't know* where to draw the line. Any particular > place that we draw the line feels a little arbitrary to me. This > includes my own choice of the work_mem-limited BTPendingRecycle array. > My patch currently works that way because it's simple -- no other > reason. > > Any scheme to further improve the "work_mem-limited BTPendingRecycle > array" design from my patch boils down to this: A new approach that > makes recycling of any remaining deleted pages take place "before too > long": After the end of the btvacuumscan() BTPendingRecycle array > stuff (presumably that didn't work out in cases where an improved > approach matters), but before the next VACUUM takes place (since that > will do the required recycling anyway, unless it's unable to do any > work at all, in which case it hardly matters). I agreed with this direction. > Here are two ideas of > my own in this same class as your idea: > > 1. Remember to do some of the BTPendingRecycle array FSM processing > stuff in btvacuumcleanup() -- defer some of the recycling of pages > recorded in BTPendingRecycle entries (paged deleted during > btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called. > > Right now btvacuumcleanup() will always do nothing when btbulkdelete() > was called earlier. But that's just a current nbtree convention, and > is no reason to not do this (we don't have to scan the index again at > all). The advantage of teaching btvacuumcleanup() to do this is that > it delays the "BTPendingRecycle array FSM processing" stuff until the > last moment that it is still easy to use the in-memory array (because > we haven't freed it yet). In general, doing it later makes it more > likely that we'll successfully recycle the pages. Though in general it > might not make any difference -- so we're still hoping that the > workload allows us to recycle everything we deleted, without making > the design much more complicated than what I posted already. > > (BTW I see that you reviewed commit 4e514c61, so you must have thought > about the trade-off between doing deferred recycling in > amvacuumcleanup() vs ambulkdelete(), when to call > IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot > implement this idea while calling IndexFreeSpaceMapVacuum() during > both btvacuumcleanup() and btbulkdelete(), so that we get the best of > both worlds -- fast recycling *and* more delayed processing that is > more likely to ultimately succeed.) I think this idea 1 also needs to serialize BTPendingRecycle array somewhere to pass it to a parallel vacuum worker in parallel vacuum case. Delaying the "BTPendingRecycle array FSM processing" stuff until btvacuumcleanup() is a good idea. But I think it's a relatively rare case in practice where index vacuum runs more than once (i.g., using up maintenance_work_mem). So considering the development cost of serializing BTPendingRecycle array and index AM API changes, attempting to recycle the deleted pages at the end of btvacuumscan() would be a balanced strategy. > > 2. Remember/serialize the BTPendingRecycle array when we realize that > we cannot put all recyclable pages in the FSM at the end of the > current btvacuumscan(), and then use an autovacuum work item to > process them before too long -- a call to AutoVacuumRequestWork() > could even serialize the data on disk. > > Idea 2 has the advantage of allowing retries -- eventually it will be > safe to recycle the pages, if we just wait long enough. This is a good idea too. Perhaps autovacuum needs to end up with an error to retry later again in case where it could not recycle all deleted pages. I have thought too about the idea to store pending-recycle pages somewhere to avoid index scan when we do the XID-is-recyclable check. My idea was to store them to btree pages dedicated for this purpose linked from the meta page but I prefer your idea. > > Anyway, I'm probably not going to pursue either of the 2 ideas for > Postgres 14. I'm mentioning these ideas now because the trade-offs > show that there is no perfect design for this deferring recycling > stuff. Whatever we do, we should accept that there is no perfect > design. > > Actually, there is one more reason why I bring up idea 1 now: I want > to hear your thoughts on the index AM API questions now, which idea 1 > touches on. Ideally all of the details around the index AM VACUUM APIs > (i.e. when and where the extra work happens -- btvacuumcleanup() vs > btbulkdelete()) won't need to change much in the future. I worry about > getting this index AM API stuff right, at least a little. After introducing parallel vacuum, index AMs are not able to pass arbitary information taken in ambulkdlete() to amvacuumcleanup(), like old gist index code does. If there is a good use case where needs to pass arbitary information to amvacuumcleanup(), I think it'd be a good idea to add an index AM API so that parallel vacuum serialize it and tells another parallel vacuum worker. But, as I mentinoed above, given that vacuum calls ambulkdelete() only once in most cases and I think we’d like to improve how to store TIDs in maintenance_work_mem space (discussed a little on thread[1]), delaying "the BTPendingRecycle array FSM processing stuff in btvacuumcleanup()” would not be a good usecase. > > > Also, maybe we can record deleted pages to FSM even without deferring > > and check it when re-using. That is, when we get a free page from FSM > > we check if the page is really recyclable (maybe _bt_getbuf() already > > does this?). IOW, a deleted page can be recycled only when it's > > requested to be reused. If btpo.xact is 64-bit XID we never need to > > worry about the case where a deleted page never be requested to be > > reused. > > I've thought about that too (both now and in the past). You're right > about _bt_getbuf() -- it checks the XID, at least on the master > branch. I took that XID check out in v4 of the patch, but I am now > starting to have my doubts about that particular choice. (I'm probably > going to restore the XID check in _bt_getbuf in v5 of the patch.) > > I took the XID-is-recyclable check out in v4 of the patch because it > might leak pages in rare cases -- which is not a new problem. > _bt_getbuf() currently has a remarkably relaxed attitude about leaking > pages from the FSM (it is more relaxed about it than I am, certainly) > -- but why should we just accept leaking pages like that? My new > doubts about it are non-specific, though. We know that the FSM isn't > crash safe -- but I think that that reduces to "practically speaking, > we can never 100% trust the FSM". Which makes me nervous. I worry that > the FSM can do something completely evil and crazy in rare cases. > > It's not just crash safety. The FSM's fsm_search_avail() function > currently changes the fp_next_slot field with only a shared buffer > lock held. It's an int, which is supposed to "be atomic on most > platforms". But we should be using real atomic ops. So the FSM is > generally...kind of wonky. > > In an ideal world, nbtree page deletion + recycling would have crash > safety built in. I don't think that it makes sense to not have free > space management without crash safety in the case of index AMs, > because it's just not worth it with whole-page units of free space > (heapam is another story). A 100% crash-safe design would naturally > shift the problem of nbtree page recycle safety from the > producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree > would be a real improvement. But these long standing FSM issues are > not going to change for Postgres 14. And so changing _bt_getbuf() to > do clever things with XIDs won't be possible for Postgres 14 IMV. Agreed. Thanks for your explanation. Regards, [1] https://www.postgresql.org/message-id/flat/CA%2Bfd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: