64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers

From Peter Geoghegan
Subject 64-bit XIDs in deleted nbtree pages
Date
Msg-id CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
Whole thread Raw
Responses Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages
List pgsql-hackers
There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2].

Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.

Of course it will still be possible for the FSM to leak deleted nbtree
index pages with the patch -- in general the FSM isn't crash safe.
That isn't so bad with the patch, though, because a subsequent VACUUM
will eventually notice the really old deleted pages, and add them back
to the FSM once again. This will always happen because
VACUUM/_bt_getbuf()/_bt_page_recyclable() can no longer become
confused about the age of deleted pages, even when they're really old.

The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.

Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:

"38 index pages have been deleted, 38 are currently reusable."

With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:

"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."

This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.

In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).

Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.

[1] https://postgr.es/m/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com
[2] https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Next
From: Tom Lane
Date:
Subject: Tightening up allowed custom GUC names