Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: 64-bit XIDs in deleted nbtree pages
Date
Msg-id CAH2-Wzn4tT8pbMCa57XExzR=MOBK3ACBtEND5HyZed-05NodWQ@mail.gmail.com
Whole thread Raw
In response to 64-bit XIDs in deleted nbtree pages  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages
List pgsql-hackers
On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The first patch teaches nbtree to use 64-bit transaction IDs here, and
> so makes it impossible to leak deleted nbtree pages. This patch is the
> nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> due to exactly the same set of problems.

There is an unresolved question for my deleted page XID patch: what
should it do about the vacuum_cleanup_index_scale_factor feature,
which added an XID to the metapage (its btm_oldest_btpo_xact field). I
refer to the work done by commit 857f9c36cda for Postgres 11 by
Masahiko. It would be good to get your opinion on this as the original
author of that feature, Masahiko.

To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
all deleted pages in the index, so clearly it needs to be carefully
considered in my patch to make the XIDs 64-bit. Even still, v1 of my
patch from today more or less ignores the issue -- it just gets a
32-bit version of the oldest value as determined by the oldestBtpoXact
XID tracking stuff (which is largely unchanged, except that it works
with 64-bit Full Transaction Ids now).

Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
field to wrap around in v1 of my patch. The obvious thing to do here
is to add a new epoch metapage field, effectively making
btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
idea. The only reason that we have the btm_oldest_btpo_xact field in
the first place is to ameliorate the problem that the patch
comprehensively solves! We should stop storing *any* XIDs in the
metapage. (Besides, adding a new "epoch" field to the metapage would
be relatively messy.)

Here is a plan that allows us to stop storing any kind of XID in the
metapage in all cases:

1. Stop maintaining the oldest XID among all deleted pages in the
entire nbtree index during VACUUM. So we can remove all of the
BTVacState.oldestBtpoXact XID tracking stuff, which is currently
something that even _bt_pagedel() needs special handling for.

2. Stop considering the btm_oldest_btpo_xact metapage field in
_bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
cares about maintaining reasonably accurate statistics for the index.
Which is really how the vacuum_cleanup_index_scale_factor feature was
intended to work all along, anyway -- ISTM that the oldestBtpoXact
stuff was always just an afterthought to paper-over this annoying
32-bit XID issue.

3. We cannot actually remove the btm_oldest_btpo_xact XID field from
the metapage, because of course that would change the BTMetaPageData
struct layout, which breaks on-disk compatibility. But why not use it
for something useful instead? _bt_update_meta_cleanup_info() can use
the same field to store the number of "newly deleted" pages from the
last btbulkdelete() instead. (See my email from earlier for the
definition of "newly deleted".)

4. Now _bt_vacuum_needs_cleanup() can once again consider the
btm_oldest_btpo_xact metapage field -- except in a totally different
way, because now it means something totally different: "newly deleted
pages during last btbulkdelete() call" (per item 3). If this # pages
is very high then we probably should do a full call to btvacuumscan()
-- _bt_vacuum_needs_cleanup() will return true to make that happen.

It's unlikely but still possible that a high number of "newly deleted
pages during the last btbulkdelete() call" is in itself a good enough
reason to do a full btvacuumscan() call when the question of calling
btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
here conservatively covers that. Maybe the 32-bit-XID-in-metapage
triggering condition had some non-obvious value due to a natural
tendency for it to limit the number of deleted pages that go
unrecycled for a long time. (Or maybe there never really was any such
natural tendency -- still seems like a good idea to make the change
described by item 4.)

Even though we are conservative (at least in this sense I just
described), we nevertheless don't actually care about very old deleted
pages that we have not yet recycled -- provided there are not very
many of them. I'm thinking of "~2% of index" as the new "newly deleted
during last btbulkdelete() call" threshold applied within
_bt_vacuum_needs_cleanup(). There is no good reason why older
deleted-but-not-yet-recycled pages should be considered more valuable
than any other page that can be used when there is a page split.

Observations about on-disk compatibility with my patch + this 4 point scheme:

A. It doesn't matter that pg_upgrade'd indexes will have an XID value
in btm_oldest_btpo_xact that now gets incorrectly interpreted as
"newly deleted pages during last btbulkdelete() call" under the 4
point scheme I just outlined.

The spurious value will get cleaned up on the next VACUUM anyway
(whether VACUUM goes through btbulkdelete() or through
btvacuumcleanup()). Besides, most indexes always have a
btm_oldest_btpo_xact value of 0.

B. The patch I posted earlier doesn't actually care about the
BTREE_VERSION of the index at all. And neither does any of the stuff I
just described for a future v2 of my patch.

All indexes can use the new format for deleted pages. On-disk
compatibility is easy here because the contents of deleted pages only
need to work as a tombstone. We can safely assume that old-format
deleted pages (pre-Postgres 14 format deleted pages) must be safe to
recycle, because the pg_upgrade itself restarts Postgres. There can be
no backends that have dangling references to the old-format deleted
page.

C. All supported nbtree versions (all nbtree versions
BTREE_MIN_VERSION+) get the same benefits under this scheme.

Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to
BTREE_NOVAC_VERSION/version 3 indexes via a call to
_bt_upgrademetapage() -- that has been the case since BTREE_VERSION
was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's
vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will
have the btm_oldest_btpo_xact metapage field that I now propose to
reuse to track "newly deleted pages during last btbulkdelete() call",
per point 4.

In summary: There are no special cases here. No BTREE_VERSION related
difficulties. That seems like a huge advantage to me.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: adding wait_start column to pg_locks
Next
From: Peter Smith
Date:
Subject: Re: Single transaction in the tablesync worker?