Re: xid wraparound danger due to INDEX_CLEANUP false - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: xid wraparound danger due to INDEX_CLEANUP false |
Date | |
Msg-id | CAH2-Wzn8heFbG=r+vjWeiuHcxMQ3i2kcLFqgK76ZMKZqR9tqig@mail.gmail.com Whole thread Raw |
In response to | Re: xid wraparound danger due to INDEX_CLEANUP false (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Thu, Apr 16, 2020 at 3:49 PM Andres Freund <andres@anarazel.de> wrote: > I think we really just stop being miserly and update the xid to be > FrozenTransactionId or InvalidTransactionId when vacuum encounters one > that's from before the the xid cutoff used by vacuum (i.e. what could > become the new relfrozenxid). That seems like it'd be a few lines, not > more. Okay. > > There is absolutely no reason why we have to delay recycling for very > > long, even in cases with long running transactions or whatever. I > > agree that it's just an accident that it works that way. VACUUM could > > probably remember deleted pages, and then revisited those pages at the > > end of the index vacuuming -- that might make a big difference in a > > lot of workloads. Or it could chain them together as a linked list > > which can be accessed much more eagerly in some cases. > > I think it doesn't really help meaningfully for vacuum to be a bit > smarter about when to recognize pages as being recyclable. IMO the big > issue is that vacuum won't be very frequent, so we'll grow the index > until that time, even if there's many "effectively empty" pages. (It seems like you're talking about the big picture now, not the problems in Postgres 11 and 12 features in this area -- you're talking about what happens to empty pages, not what happens to deleted pages.) I'll say some more things about the less ambitious goal of more eager recycling of pages, as they are deleted: An individual VACUUM operation cannot recycle a page right after _bt_pagedel() is called to delete the page. VACUUM will both set a target leaf page half dead and delete it all at once in _bt_pagedel() (it does that much in the simple and common case). Again, this is because recycling very soon after the call to _bt_pagedel() will introduce races with concurrent index scans -- they could fail to observe the deletion in the parent (i.e. see its downlink, since child isn't even half dead), and land on a concurrently recycled page (VACUUM concurrently marks the page half-dead, fully dead/deleted, and then even goes as far as recycling it). So the current design makes a certain amount of sense -- we can't be super aggressive like that. (Actually, maybe it doesn't make sense to not just put the page in the FSM there and then -- see "Thinks some more" below.) Even still, nothing stops the same VACUUM operation from (for example) remembering a list of pages it has deleted during the current scan, and then coming back at the end of the bulk scan of the index to reconsider if it can recycle the pages now (2 minutes later instead of 2 months later). With a new RecentGlobalXmin (or something that's conceptually like a new RecentGlobalXmin). Similarly, we could do limited VACUUMs that only visit previously deleted pages, once VACUUM is taught to chain deleted pages together to optimize recycling. We don't have to repeat another pass over the entire index to recycle the pages because of this special deleted page linking. This is something that we use when we have to recycle pages, but it's a " INDEX_CLEANUP = off" index VACUUM -- we don't really want to do most of the stuff that index vacuuming needs to do, but we must still visit the metapage to check btm_oldest_btpo_xact, and then maybe walk the deleted page linked list. *** Thinks some more *** As you pointed out, _bt_getbuf() already distrusts the FSM -- it has its own _bt_page_recyclable() check, probably because the FSM isn't crash safe. Maybe we could improve matters by teaching _bt_pagedel() to put a page it deleted in the FSM immediately -- no need to wait until the next index VACUUM for the RecordFreeIndexPage() call. It still isn't quite enough that _bt_getbuf() distrusts the FSM, so we'd also have to teach _bt_getbuf() some heuristics that made it understand that VACUUM is now designed to put stuff in the FSM immediately, so we don't have to wait for the next VACUUM operation to get to it. Maybe _bt_getbuf() should try the FSM a few times before giving up and allocating a new page, etc. This wouldn't make VACUUM delete any more pages any sooner, but it would make those pages reclaimable much sooner. Also, it wouldn't solve the wraparound problem, but that is a bug, not a performance/efficiency issue. > I.e. even if the killtuples logic allows us to recognize that all actual > index tuples are fully dead, we'll not benefit from that unless there's > a new insertion that belongs onto the "empty" page. That's fine for > indexes that are updated roughly evenly across the value range, but > terrible for indexes that "grow" mostly on one side, and "shrink" on the > other. That could be true, but there are certain things about B-Tree space utilization that might surprise you: https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3 > I'd bet it be beneficial if we were to either have scans unlink such > pages directly, or if they just entered the page into the FSM and have > _bt_getbuf() do the unlinking. That won't work, since you're now talking about pages that aren't deleted (or even half-dead) that are just candidates to be deleted because they're empty. So you'd have to do all the steps in _bt_pagedel() within a new _bt_getbuf() path, which would have many deadlock hazards. Unlinking the page from the tree itself (deleting) is really complicated. > I'm not sure if the current locking > model assumes anywhere that there is only one process (vacuum) unlinking > pages though? I'm not sure, though _bt_unlink_halfdead_page() has comments supposing that there could be concurrent page deletions like that. -- Peter Geoghegan
pgsql-hackers by date: