Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: 64-bit XIDs in deleted nbtree pages
Date
Msg-id 21b75775-3c41-7602-7267-79d2e7105223@iki.fi
Whole thread Raw
In response to 64-bit XIDs in deleted nbtree pages  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: 64-bit XIDs in deleted nbtree pages
List pgsql-hackers
On 10/02/2021 00:14, Peter Geoghegan wrote:
> There is a long standing problem with the way that nbtree page
> deletion places deleted pages in the FSM for recycling: The use of a
> 32-bit XID within the deleted page (in the special
> area's/BTPageOpaqueData struct's btpo.xact field) is not robust
> against XID wraparound, which can lead to permanently leaking pages in
> a variety of scenarios. The problems became worse with the addition of
> the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
> in this context creates risk for any further improvements in VACUUM
> that similarly involve skipping whole indexes. For example, Masahiko
> has been working on a patch that teaches VACUUM to skip indexes that
> are known to have very little garbage [2].
> 
> Attached patch series fixes the issue once and for all. This is
> something that I'm targeting for Postgres 14, since it's more or less
> a bug fix.

Thanks for picking this up!

> The first patch teaches nbtree to use 64-bit transaction IDs here, and
> so makes it impossible to leak deleted nbtree pages. This patch is the
> nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> due to exactly the same set of problems. The first patch also makes
> the level field stored in nbtree page's special area/BTPageOpaqueData
> reliably store the level, even in a deleted page. This allows me to
> consistently use the level field within amcheck, including even within
> deleted pages.

Is it really worth the trouble to maintain 'level' on deleted pages? All 
you currently do with it is check that the BTP_LEAF flag is set iff 
"level == 0", which seems pointless. I guess there could be some 
forensic value in keeping 'level', but meh.

> The second patch in the series adds new information to VACUUM VERBOSE.
> This makes it easy to understand what's going on here. Index page
> deletion related output becomes more useful. It might also help with
> debugging the first patch.
> 
> Currently, VACUUM VERBOSE output for an index that has some page
> deletions looks like this:
> 
> "38 index pages have been deleted, 38 are currently reusable."
> 
> With the second patch applied, we might see this output at the same
> point in VACUUM VERBOSE output instead:
> 
> "38 index pages have been deleted, 0 are newly deleted, 38 are
> currently reusable."
> 
> This means that out of the 38 of the pages that were found to be
> marked deleted in the index, 0 were deleted by the VACUUM operation
> whose output we see here. That is, there were 0 nbtree pages that were
> newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
> *this particular* VACUUM -- the VACUUM operation that we see
> instrumentation about here. It follows that the 38 deleted pages that
> we encountered must have been marked BTP_DELETED by some previous
> VACUUM operation.
> 
> In practice the "%u are currently reusable" output should never
> include newly deleted pages, since there is no way that a page marked
> BTP_DELETED can be put in the FSM during the same VACUUM operation --
> that's unsafe (we need all of this recycling/XID indirection precisely
> because we need to delay recycling until it is truly safe, of course).
> Note that the "%u index pages have been deleted" output includes both
> pages deleted by some previous VACUUM operation, and newly deleted
> pages (no change there).
> 
> Note that the new "newly deleted" output is instrumentation about this
> particular *VACUUM operation*. In contrast, the other two existing
> output numbers ("deleted" and "currently reusable") are actually
> instrumentation about the state of the *index as a whole* at a point
> in time (barring concurrent recycling of pages counted in VACUUM by
> some random _bt_getbuf() call in another backend). This fundamental
> distinction is important here. All 3 numbers/stats that we output can
> have different values, which can be used to debug the first patch. You
> can directly observe uncommon cases just from the VERBOSE output, like
> when a long running transaction holds up recycling of a deleted page
> that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
> And so if the first patch had any bugs, there'd be a pretty good
> chance that you could observe them using multiple VACUUM VERBOSE
> operations -- you might notice something inconsistent or contradictory
> just by examining the output over time, how things change, etc.

The full message on master is:

INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.

How about:

INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  250000 index row versions and 686 pages were removed.
2056 index pages are now unused, 1370 are currently reusable.

The idea is that the first DETAIL line now says what the VACUUM did this 
round, and the last line says what the state of the index is now. One 
concern with that phrasing is that it might not be clear what "686 pages 
were removed" means. We don't actually shrink the file. Then again, I'm 
not sure if the "have been deleted" was any better in that regard.

It's still a bit weird that the "what VACUUM did this round" information 
is sandwiched between the two other lines that talk about the state of 
the index after the operation. But I think the language now makes it 
more clear which is which. Or perhaps flip the INFO and first DETAIL 
lines around like this:

INFO: 250000 index row versions and 686 pages were removed from index 
"foo_pkey"
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

For context, the more full message you get on master is:

postgres=# vacuum verbose foo;
INFO:  vacuuming "public.foo"
INFO:  scanned index "foo_pkey" to remove 250000 row versions
DETAIL:  CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
INFO:  "foo": removed 250000 row versions in 1107 pages
DETAIL:  CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "foo": found 250000 removable, 271 nonremovable row versions in 
1108 out of 4425 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1164
There were 87 unused item identifiers.
Skipped 0 pages due to buffer pins, 2212 frozen pages.
0 pages are entirely empty.
CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.28 s.
VACUUM

That's pretty confusing, it's a mix of basically progress indicators 
(vacuuming "public.foo"), CPU measurements, information about what was 
removed, and what the state is afterwards. Would be nice to make that 
more clear overall. But for now, for this particular INFO message, 
perhaps make it more consistent with the lines printed by heapam, like this:

INFO: "foo_pkey": removed 250000 index row versions and 686 pages
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

- Heikki



pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: pg_get_first_normal_oid()?
Next
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: Parallel INSERT (INTO ... SELECT ...)