64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | 64-bit XIDs in deleted nbtree pages |
Date | |
Msg-id | CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com Whole thread Raw |
Responses |
Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages |
List | pgsql-hackers |
There is a long standing problem with the way that nbtree page deletion places deleted pages in the FSM for recycling: The use of a 32-bit XID within the deleted page (in the special area's/BTPageOpaqueData struct's btpo.xact field) is not robust against XID wraparound, which can lead to permanently leaking pages in a variety of scenarios. The problems became worse with the addition of the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID in this context creates risk for any further improvements in VACUUM that similarly involve skipping whole indexes. For example, Masahiko has been working on a patch that teaches VACUUM to skip indexes that are known to have very little garbage [2]. Attached patch series fixes the issue once and for all. This is something that I'm targeting for Postgres 14, since it's more or less a bug fix. The first patch teaches nbtree to use 64-bit transaction IDs here, and so makes it impossible to leak deleted nbtree pages. This patch is the nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs due to exactly the same set of problems. The first patch also makes the level field stored in nbtree page's special area/BTPageOpaqueData reliably store the level, even in a deleted page. This allows me to consistently use the level field within amcheck, including even within deleted pages. Of course it will still be possible for the FSM to leak deleted nbtree index pages with the patch -- in general the FSM isn't crash safe. That isn't so bad with the patch, though, because a subsequent VACUUM will eventually notice the really old deleted pages, and add them back to the FSM once again. This will always happen because VACUUM/_bt_getbuf()/_bt_page_recyclable() can no longer become confused about the age of deleted pages, even when they're really old. The second patch in the series adds new information to VACUUM VERBOSE. This makes it easy to understand what's going on here. Index page deletion related output becomes more useful. It might also help with debugging the first patch. Currently, VACUUM VERBOSE output for an index that has some page deletions looks like this: "38 index pages have been deleted, 38 are currently reusable." With the second patch applied, we might see this output at the same point in VACUUM VERBOSE output instead: "38 index pages have been deleted, 0 are newly deleted, 38 are currently reusable." This means that out of the 38 of the pages that were found to be marked deleted in the index, 0 were deleted by the VACUUM operation whose output we see here. That is, there were 0 nbtree pages that were newly marked BTP_DELETED within _bt_unlink_halfdead_page() during *this particular* VACUUM -- the VACUUM operation that we see instrumentation about here. It follows that the 38 deleted pages that we encountered must have been marked BTP_DELETED by some previous VACUUM operation. In practice the "%u are currently reusable" output should never include newly deleted pages, since there is no way that a page marked BTP_DELETED can be put in the FSM during the same VACUUM operation -- that's unsafe (we need all of this recycling/XID indirection precisely because we need to delay recycling until it is truly safe, of course). Note that the "%u index pages have been deleted" output includes both pages deleted by some previous VACUUM operation, and newly deleted pages (no change there). Note that the new "newly deleted" output is instrumentation about this particular *VACUUM operation*. In contrast, the other two existing output numbers ("deleted" and "currently reusable") are actually instrumentation about the state of the *index as a whole* at a point in time (barring concurrent recycling of pages counted in VACUUM by some random _bt_getbuf() call in another backend). This fundamental distinction is important here. All 3 numbers/stats that we output can have different values, which can be used to debug the first patch. You can directly observe uncommon cases just from the VERBOSE output, like when a long running transaction holds up recycling of a deleted page that was actually marked BTP_DELETED in an *earlier* VACUUM operation. And so if the first patch had any bugs, there'd be a pretty good chance that you could observe them using multiple VACUUM VERBOSE operations -- you might notice something inconsistent or contradictory just by examining the output over time, how things change, etc. [1] https://postgr.es/m/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com [2] https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com -- Peter Geoghegan
Attachment
pgsql-hackers by date: