Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAH2-Wzn4tT8pbMCa57XExzR=MOBK3ACBtEND5HyZed-05NodWQ@mail.gmail.com Whole thread Raw |
In response to | 64-bit XIDs in deleted nbtree pages (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages |
List | pgsql-hackers |
On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > The first patch teaches nbtree to use 64-bit transaction IDs here, and > so makes it impossible to leak deleted nbtree pages. This patch is the > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs > due to exactly the same set of problems. There is an unresolved question for my deleted page XID patch: what should it do about the vacuum_cleanup_index_scale_factor feature, which added an XID to the metapage (its btm_oldest_btpo_xact field). I refer to the work done by commit 857f9c36cda for Postgres 11 by Masahiko. It would be good to get your opinion on this as the original author of that feature, Masahiko. To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among all deleted pages in the index, so clearly it needs to be carefully considered in my patch to make the XIDs 64-bit. Even still, v1 of my patch from today more or less ignores the issue -- it just gets a 32-bit version of the oldest value as determined by the oldestBtpoXact XID tracking stuff (which is largely unchanged, except that it works with 64-bit Full Transaction Ids now). Obviously it is still possible for the 32-bit btm_oldest_btpo_xact field to wrap around in v1 of my patch. The obvious thing to do here is to add a new epoch metapage field, effectively making btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good idea. The only reason that we have the btm_oldest_btpo_xact field in the first place is to ameliorate the problem that the patch comprehensively solves! We should stop storing *any* XIDs in the metapage. (Besides, adding a new "epoch" field to the metapage would be relatively messy.) Here is a plan that allows us to stop storing any kind of XID in the metapage in all cases: 1. Stop maintaining the oldest XID among all deleted pages in the entire nbtree index during VACUUM. So we can remove all of the BTVacState.oldestBtpoXact XID tracking stuff, which is currently something that even _bt_pagedel() needs special handling for. 2. Stop considering the btm_oldest_btpo_xact metapage field in _bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only cares about maintaining reasonably accurate statistics for the index. Which is really how the vacuum_cleanup_index_scale_factor feature was intended to work all along, anyway -- ISTM that the oldestBtpoXact stuff was always just an afterthought to paper-over this annoying 32-bit XID issue. 3. We cannot actually remove the btm_oldest_btpo_xact XID field from the metapage, because of course that would change the BTMetaPageData struct layout, which breaks on-disk compatibility. But why not use it for something useful instead? _bt_update_meta_cleanup_info() can use the same field to store the number of "newly deleted" pages from the last btbulkdelete() instead. (See my email from earlier for the definition of "newly deleted".) 4. Now _bt_vacuum_needs_cleanup() can once again consider the btm_oldest_btpo_xact metapage field -- except in a totally different way, because now it means something totally different: "newly deleted pages during last btbulkdelete() call" (per item 3). If this # pages is very high then we probably should do a full call to btvacuumscan() -- _bt_vacuum_needs_cleanup() will return true to make that happen. It's unlikely but still possible that a high number of "newly deleted pages during the last btbulkdelete() call" is in itself a good enough reason to do a full btvacuumscan() call when the question of calling btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4 here conservatively covers that. Maybe the 32-bit-XID-in-metapage triggering condition had some non-obvious value due to a natural tendency for it to limit the number of deleted pages that go unrecycled for a long time. (Or maybe there never really was any such natural tendency -- still seems like a good idea to make the change described by item 4.) Even though we are conservative (at least in this sense I just described), we nevertheless don't actually care about very old deleted pages that we have not yet recycled -- provided there are not very many of them. I'm thinking of "~2% of index" as the new "newly deleted during last btbulkdelete() call" threshold applied within _bt_vacuum_needs_cleanup(). There is no good reason why older deleted-but-not-yet-recycled pages should be considered more valuable than any other page that can be used when there is a page split. Observations about on-disk compatibility with my patch + this 4 point scheme: A. It doesn't matter that pg_upgrade'd indexes will have an XID value in btm_oldest_btpo_xact that now gets incorrectly interpreted as "newly deleted pages during last btbulkdelete() call" under the 4 point scheme I just outlined. The spurious value will get cleaned up on the next VACUUM anyway (whether VACUUM goes through btbulkdelete() or through btvacuumcleanup()). Besides, most indexes always have a btm_oldest_btpo_xact value of 0. B. The patch I posted earlier doesn't actually care about the BTREE_VERSION of the index at all. And neither does any of the stuff I just described for a future v2 of my patch. All indexes can use the new format for deleted pages. On-disk compatibility is easy here because the contents of deleted pages only need to work as a tombstone. We can safely assume that old-format deleted pages (pre-Postgres 14 format deleted pages) must be safe to recycle, because the pg_upgrade itself restarts Postgres. There can be no backends that have dangling references to the old-format deleted page. C. All supported nbtree versions (all nbtree versions BTREE_MIN_VERSION+) get the same benefits under this scheme. Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to BTREE_NOVAC_VERSION/version 3 indexes via a call to _bt_upgrademetapage() -- that has been the case since BTREE_VERSION was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will have the btm_oldest_btpo_xact metapage field that I now propose to reuse to track "newly deleted pages during last btbulkdelete() call", per point 4. In summary: There are no special cases here. No BTREE_VERSION related difficulties. That seems like a huge advantage to me. -- Peter Geoghegan
pgsql-hackers by date: