Re: xid wraparound danger due to INDEX_CLEANUP false - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: xid wraparound danger due to INDEX_CLEANUP false |
Date | |
Msg-id | 20200429215559.yuzhunlyyqkmxhws@alap3.anarazel.de Whole thread Raw |
In response to | Re: xid wraparound danger due to INDEX_CLEANUP false (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: xid wraparound danger due to INDEX_CLEANUP false
|
List | pgsql-hackers |
Hi, On 2020-04-29 13:40:55 -0700, Peter Geoghegan wrote: > On Wed, Apr 29, 2020 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > > > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from > > > bpto.xact) when it recycles deleted pages. It simply puts them in the > > > FSM without changing anything about the page itself. This means > > > surprisingly little in the context of nbtree: the > > > _bt_page_recyclable() XID check that takes place in btvacuumpage() > > > also takes place in _bt_getbuf(), at the point where the page actually > > > gets recycled by the client. That's not great. > > > > I think it's quite foolish for btvacuumpage() to not freeze xids. If we > > only do so when necessary (i.e. older than a potential new relfrozenxid, > > and only when the vacuum didn't yet skip pages), the costs are pretty > > miniscule. > > I wonder if we should just bite the bullet and mark pages that are > recycled by VACUUM as explicitly recycled, with WAL-logging and > everything (this is like freezing, but stronger). That way, the > _bt_page_recyclable() call within _bt_getbuf() would only be required > to check that state (while btvacuumpage() would use something like a > _bt_page_eligible_for_recycling(), which would do the same thing as > the current _bt_page_recyclable()). I'm not sure I see the advantage. Only doing so in the freezing case seems unlikely to cause additional conflicts, but I'm less sure about doing it always. btpo.xact will often be quite recent, right? So it'd likely cause more conflicts. > I'm not sure how low the costs would be, but at least we'd only have > to do it once per already-deleted page (i.e. only the first time a > VACUUM operation found _bt_page_eligible_for_recycling() returned true > for the page and marked it recycled in a crash safe manner). That > design would be quite a lot simpler, because it expresses the problem > in terms that make sense to the nbtree code. _bt_getbuf() should not > have to make a distinction between "possibly recycled" versus > "definitely recycled". I don't really see the problem with the check in _bt_getbuf()? I'd actually like to be *more* aggressive about putting pages in the FSM (or whatever), and that'd probably require checks like this. E.g. whenever we unlink a page, we should put it into the FSM (or something different, see below). And then do all the necessary checks in _bt_getbuf(). It's pretty sad that one right now basically needs to vacuum twice to reuse pages in nbtree (once to delete the page, once to record it in the fsm). Usually the xmin horizon should advance much more quickly than that, allowing reuse earlier. As far as I can tell, even just adding them to the FSM when setting ISDELETED would be advantageous. There's obviously that we'll cause backends to visit a lot of pages that can't actually be reused... But if we did what I suggest below, that danger probably could largely be avoided. > It makes sense that the FSM is not crash safe, I suppose, but we're > talking about relatively large amounts of free space here. Can't we > just do it properly/reliably? What do you mean by that? To have the FSM be crash-safe? It could make sense to just not have the FSM, and have a linked-list style stack of pages reachable from the meta page. That'd be especially advantageous if we kept xids associated with the "about to be recyclable" pages in the metapage, so it'd be cheap to evaluate. There's possibly some not-entirely-trivial locking concerns around such a list. Adding entries seems easy enough, because we currently only delete pages from within vacuum. But popping entries could be more complicated, I'm not exactly sure if there are potential lock nesting issues (nor am I actually sure there aren't existing ones). Greetings, Andres Freund
pgsql-hackers by date: