Thread: 64-bit XIDs in deleted nbtree pages
There is a long standing problem with the way that nbtree page deletion places deleted pages in the FSM for recycling: The use of a 32-bit XID within the deleted page (in the special area's/BTPageOpaqueData struct's btpo.xact field) is not robust against XID wraparound, which can lead to permanently leaking pages in a variety of scenarios. The problems became worse with the addition of the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID in this context creates risk for any further improvements in VACUUM that similarly involve skipping whole indexes. For example, Masahiko has been working on a patch that teaches VACUUM to skip indexes that are known to have very little garbage [2]. Attached patch series fixes the issue once and for all. This is something that I'm targeting for Postgres 14, since it's more or less a bug fix. The first patch teaches nbtree to use 64-bit transaction IDs here, and so makes it impossible to leak deleted nbtree pages. This patch is the nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs due to exactly the same set of problems. The first patch also makes the level field stored in nbtree page's special area/BTPageOpaqueData reliably store the level, even in a deleted page. This allows me to consistently use the level field within amcheck, including even within deleted pages. Of course it will still be possible for the FSM to leak deleted nbtree index pages with the patch -- in general the FSM isn't crash safe. That isn't so bad with the patch, though, because a subsequent VACUUM will eventually notice the really old deleted pages, and add them back to the FSM once again. This will always happen because VACUUM/_bt_getbuf()/_bt_page_recyclable() can no longer become confused about the age of deleted pages, even when they're really old. The second patch in the series adds new information to VACUUM VERBOSE. This makes it easy to understand what's going on here. Index page deletion related output becomes more useful. It might also help with debugging the first patch. Currently, VACUUM VERBOSE output for an index that has some page deletions looks like this: "38 index pages have been deleted, 38 are currently reusable." With the second patch applied, we might see this output at the same point in VACUUM VERBOSE output instead: "38 index pages have been deleted, 0 are newly deleted, 38 are currently reusable." This means that out of the 38 of the pages that were found to be marked deleted in the index, 0 were deleted by the VACUUM operation whose output we see here. That is, there were 0 nbtree pages that were newly marked BTP_DELETED within _bt_unlink_halfdead_page() during *this particular* VACUUM -- the VACUUM operation that we see instrumentation about here. It follows that the 38 deleted pages that we encountered must have been marked BTP_DELETED by some previous VACUUM operation. In practice the "%u are currently reusable" output should never include newly deleted pages, since there is no way that a page marked BTP_DELETED can be put in the FSM during the same VACUUM operation -- that's unsafe (we need all of this recycling/XID indirection precisely because we need to delay recycling until it is truly safe, of course). Note that the "%u index pages have been deleted" output includes both pages deleted by some previous VACUUM operation, and newly deleted pages (no change there). Note that the new "newly deleted" output is instrumentation about this particular *VACUUM operation*. In contrast, the other two existing output numbers ("deleted" and "currently reusable") are actually instrumentation about the state of the *index as a whole* at a point in time (barring concurrent recycling of pages counted in VACUUM by some random _bt_getbuf() call in another backend). This fundamental distinction is important here. All 3 numbers/stats that we output can have different values, which can be used to debug the first patch. You can directly observe uncommon cases just from the VERBOSE output, like when a long running transaction holds up recycling of a deleted page that was actually marked BTP_DELETED in an *earlier* VACUUM operation. And so if the first patch had any bugs, there'd be a pretty good chance that you could observe them using multiple VACUUM VERBOSE operations -- you might notice something inconsistent or contradictory just by examining the output over time, how things change, etc. [1] https://postgr.es/m/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com [2] https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com -- Peter Geoghegan
Attachment
On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > The first patch teaches nbtree to use 64-bit transaction IDs here, and > so makes it impossible to leak deleted nbtree pages. This patch is the > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs > due to exactly the same set of problems. There is an unresolved question for my deleted page XID patch: what should it do about the vacuum_cleanup_index_scale_factor feature, which added an XID to the metapage (its btm_oldest_btpo_xact field). I refer to the work done by commit 857f9c36cda for Postgres 11 by Masahiko. It would be good to get your opinion on this as the original author of that feature, Masahiko. To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among all deleted pages in the index, so clearly it needs to be carefully considered in my patch to make the XIDs 64-bit. Even still, v1 of my patch from today more or less ignores the issue -- it just gets a 32-bit version of the oldest value as determined by the oldestBtpoXact XID tracking stuff (which is largely unchanged, except that it works with 64-bit Full Transaction Ids now). Obviously it is still possible for the 32-bit btm_oldest_btpo_xact field to wrap around in v1 of my patch. The obvious thing to do here is to add a new epoch metapage field, effectively making btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good idea. The only reason that we have the btm_oldest_btpo_xact field in the first place is to ameliorate the problem that the patch comprehensively solves! We should stop storing *any* XIDs in the metapage. (Besides, adding a new "epoch" field to the metapage would be relatively messy.) Here is a plan that allows us to stop storing any kind of XID in the metapage in all cases: 1. Stop maintaining the oldest XID among all deleted pages in the entire nbtree index during VACUUM. So we can remove all of the BTVacState.oldestBtpoXact XID tracking stuff, which is currently something that even _bt_pagedel() needs special handling for. 2. Stop considering the btm_oldest_btpo_xact metapage field in _bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only cares about maintaining reasonably accurate statistics for the index. Which is really how the vacuum_cleanup_index_scale_factor feature was intended to work all along, anyway -- ISTM that the oldestBtpoXact stuff was always just an afterthought to paper-over this annoying 32-bit XID issue. 3. We cannot actually remove the btm_oldest_btpo_xact XID field from the metapage, because of course that would change the BTMetaPageData struct layout, which breaks on-disk compatibility. But why not use it for something useful instead? _bt_update_meta_cleanup_info() can use the same field to store the number of "newly deleted" pages from the last btbulkdelete() instead. (See my email from earlier for the definition of "newly deleted".) 4. Now _bt_vacuum_needs_cleanup() can once again consider the btm_oldest_btpo_xact metapage field -- except in a totally different way, because now it means something totally different: "newly deleted pages during last btbulkdelete() call" (per item 3). If this # pages is very high then we probably should do a full call to btvacuumscan() -- _bt_vacuum_needs_cleanup() will return true to make that happen. It's unlikely but still possible that a high number of "newly deleted pages during the last btbulkdelete() call" is in itself a good enough reason to do a full btvacuumscan() call when the question of calling btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4 here conservatively covers that. Maybe the 32-bit-XID-in-metapage triggering condition had some non-obvious value due to a natural tendency for it to limit the number of deleted pages that go unrecycled for a long time. (Or maybe there never really was any such natural tendency -- still seems like a good idea to make the change described by item 4.) Even though we are conservative (at least in this sense I just described), we nevertheless don't actually care about very old deleted pages that we have not yet recycled -- provided there are not very many of them. I'm thinking of "~2% of index" as the new "newly deleted during last btbulkdelete() call" threshold applied within _bt_vacuum_needs_cleanup(). There is no good reason why older deleted-but-not-yet-recycled pages should be considered more valuable than any other page that can be used when there is a page split. Observations about on-disk compatibility with my patch + this 4 point scheme: A. It doesn't matter that pg_upgrade'd indexes will have an XID value in btm_oldest_btpo_xact that now gets incorrectly interpreted as "newly deleted pages during last btbulkdelete() call" under the 4 point scheme I just outlined. The spurious value will get cleaned up on the next VACUUM anyway (whether VACUUM goes through btbulkdelete() or through btvacuumcleanup()). Besides, most indexes always have a btm_oldest_btpo_xact value of 0. B. The patch I posted earlier doesn't actually care about the BTREE_VERSION of the index at all. And neither does any of the stuff I just described for a future v2 of my patch. All indexes can use the new format for deleted pages. On-disk compatibility is easy here because the contents of deleted pages only need to work as a tombstone. We can safely assume that old-format deleted pages (pre-Postgres 14 format deleted pages) must be safe to recycle, because the pg_upgrade itself restarts Postgres. There can be no backends that have dangling references to the old-format deleted page. C. All supported nbtree versions (all nbtree versions BTREE_MIN_VERSION+) get the same benefits under this scheme. Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to BTREE_NOVAC_VERSION/version 3 indexes via a call to _bt_upgrademetapage() -- that has been the case since BTREE_VERSION was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will have the btm_oldest_btpo_xact metapage field that I now propose to reuse to track "newly deleted pages during last btbulkdelete() call", per point 4. In summary: There are no special cases here. No BTREE_VERSION related difficulties. That seems like a huge advantage to me. -- Peter Geoghegan
On Tue, Feb 9, 2021 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > Here is a plan that allows us to stop storing any kind of XID in the > metapage in all cases: Attached is v2, which deals with the metapage 32-bit XID/btm_oldest_btpo_xact issue using the approach I described earlier. We don't store an XID in the metapage anymore in v2. This seems to work well, as I expected it would. -- Peter Geoghegan
Attachment
On 10/02/2021 00:14, Peter Geoghegan wrote: > There is a long standing problem with the way that nbtree page > deletion places deleted pages in the FSM for recycling: The use of a > 32-bit XID within the deleted page (in the special > area's/BTPageOpaqueData struct's btpo.xact field) is not robust > against XID wraparound, which can lead to permanently leaking pages in > a variety of scenarios. The problems became worse with the addition of > the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID > in this context creates risk for any further improvements in VACUUM > that similarly involve skipping whole indexes. For example, Masahiko > has been working on a patch that teaches VACUUM to skip indexes that > are known to have very little garbage [2]. > > Attached patch series fixes the issue once and for all. This is > something that I'm targeting for Postgres 14, since it's more or less > a bug fix. Thanks for picking this up! > The first patch teaches nbtree to use 64-bit transaction IDs here, and > so makes it impossible to leak deleted nbtree pages. This patch is the > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs > due to exactly the same set of problems. The first patch also makes > the level field stored in nbtree page's special area/BTPageOpaqueData > reliably store the level, even in a deleted page. This allows me to > consistently use the level field within amcheck, including even within > deleted pages. Is it really worth the trouble to maintain 'level' on deleted pages? All you currently do with it is check that the BTP_LEAF flag is set iff "level == 0", which seems pointless. I guess there could be some forensic value in keeping 'level', but meh. > The second patch in the series adds new information to VACUUM VERBOSE. > This makes it easy to understand what's going on here. Index page > deletion related output becomes more useful. It might also help with > debugging the first patch. > > Currently, VACUUM VERBOSE output for an index that has some page > deletions looks like this: > > "38 index pages have been deleted, 38 are currently reusable." > > With the second patch applied, we might see this output at the same > point in VACUUM VERBOSE output instead: > > "38 index pages have been deleted, 0 are newly deleted, 38 are > currently reusable." > > This means that out of the 38 of the pages that were found to be > marked deleted in the index, 0 were deleted by the VACUUM operation > whose output we see here. That is, there were 0 nbtree pages that were > newly marked BTP_DELETED within _bt_unlink_halfdead_page() during > *this particular* VACUUM -- the VACUUM operation that we see > instrumentation about here. It follows that the 38 deleted pages that > we encountered must have been marked BTP_DELETED by some previous > VACUUM operation. > > In practice the "%u are currently reusable" output should never > include newly deleted pages, since there is no way that a page marked > BTP_DELETED can be put in the FSM during the same VACUUM operation -- > that's unsafe (we need all of this recycling/XID indirection precisely > because we need to delay recycling until it is truly safe, of course). > Note that the "%u index pages have been deleted" output includes both > pages deleted by some previous VACUUM operation, and newly deleted > pages (no change there). > > Note that the new "newly deleted" output is instrumentation about this > particular *VACUUM operation*. In contrast, the other two existing > output numbers ("deleted" and "currently reusable") are actually > instrumentation about the state of the *index as a whole* at a point > in time (barring concurrent recycling of pages counted in VACUUM by > some random _bt_getbuf() call in another backend). This fundamental > distinction is important here. All 3 numbers/stats that we output can > have different values, which can be used to debug the first patch. You > can directly observe uncommon cases just from the VERBOSE output, like > when a long running transaction holds up recycling of a deleted page > that was actually marked BTP_DELETED in an *earlier* VACUUM operation. > And so if the first patch had any bugs, there'd be a pretty good > chance that you could observe them using multiple VACUUM VERBOSE > operations -- you might notice something inconsistent or contradictory > just by examining the output over time, how things change, etc. The full message on master is: INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages DETAIL: 250000 index row versions were removed. 2056 index pages have been deleted, 1370 are currently reusable. How about: INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages DETAIL: 250000 index row versions and 686 pages were removed. 2056 index pages are now unused, 1370 are currently reusable. The idea is that the first DETAIL line now says what the VACUUM did this round, and the last line says what the state of the index is now. One concern with that phrasing is that it might not be clear what "686 pages were removed" means. We don't actually shrink the file. Then again, I'm not sure if the "have been deleted" was any better in that regard. It's still a bit weird that the "what VACUUM did this round" information is sandwiched between the two other lines that talk about the state of the index after the operation. But I think the language now makes it more clear which is which. Or perhaps flip the INFO and first DETAIL lines around like this: INFO: 250000 index row versions and 686 pages were removed from index "foo_pkey" DETAIL: index now contains 250001 row versions in 2745 pages. 2056 index pages are now unused, of which 1370 are currently reusable. For context, the more full message you get on master is: postgres=# vacuum verbose foo; INFO: vacuuming "public.foo" INFO: scanned index "foo_pkey" to remove 250000 row versions DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s INFO: "foo": removed 250000 row versions in 1107 pages DETAIL: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages DETAIL: 250000 index row versions were removed. 2056 index pages have been deleted, 1370 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "foo": found 250000 removable, 271 nonremovable row versions in 1108 out of 4425 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1164 There were 87 unused item identifiers. Skipped 0 pages due to buffer pins, 2212 frozen pages. 0 pages are entirely empty. CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.28 s. VACUUM That's pretty confusing, it's a mix of basically progress indicators (vacuuming "public.foo"), CPU measurements, information about what was removed, and what the state is afterwards. Would be nice to make that more clear overall. But for now, for this particular INFO message, perhaps make it more consistent with the lines printed by heapam, like this: INFO: "foo_pkey": removed 250000 index row versions and 686 pages DETAIL: index now contains 250001 row versions in 2745 pages. 2056 index pages are now unused, of which 1370 are currently reusable. - Heikki
On Wed, Feb 10, 2021 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The first patch teaches nbtree to use 64-bit transaction IDs here, and > > so makes it impossible to leak deleted nbtree pages. This patch is the > > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs > > due to exactly the same set of problems. Thank you for working on this! > > There is an unresolved question for my deleted page XID patch: what > should it do about the vacuum_cleanup_index_scale_factor feature, > which added an XID to the metapage (its btm_oldest_btpo_xact field). I > refer to the work done by commit 857f9c36cda for Postgres 11 by > Masahiko. It would be good to get your opinion on this as the original > author of that feature, Masahiko. > > To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among > all deleted pages in the index, so clearly it needs to be carefully > considered in my patch to make the XIDs 64-bit. Even still, v1 of my > patch from today more or less ignores the issue -- it just gets a > 32-bit version of the oldest value as determined by the oldestBtpoXact > XID tracking stuff (which is largely unchanged, except that it works > with 64-bit Full Transaction Ids now). > > Obviously it is still possible for the 32-bit btm_oldest_btpo_xact > field to wrap around in v1 of my patch. The obvious thing to do here > is to add a new epoch metapage field, effectively making > btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good > idea. The only reason that we have the btm_oldest_btpo_xact field in > the first place is to ameliorate the problem that the patch > comprehensively solves! We should stop storing *any* XIDs in the > metapage. (Besides, adding a new "epoch" field to the metapage would > be relatively messy.) I agree that btm_oldest_btpo_xact will no longer be necessary in terms of recycling deleted pages. The purpose of btm_oldest_btpo_xact is to prevent deleted pages from being leaked. As you mentioned, it has the oldest btpo.xact in BTPageOpaqueData among all deleted pages in the index. Looking back to the time when we develop INDEX_CLEANUP option if we skip index cleanup (meaning both ambulkdelete and amvaucumcleanup), there was a problem in btree indexes that deleted pages could never be recycled if XID wraps around. So the idea behind btm_oldest_btpo_xact is, we remember the oldest btpo.xact among the all deleted pages and do btvacuumscan() if this value is older than global xmin (meaning there is at least one recyclable page). That way, we can recycle the deleted pages without leaking the pages (of course, unless INDEX_CLEANUP is disabled). Given that we can guarantee that deleted pages never be leaked by using 64-bit XID, I also think we don't need this value. We can do amvacuumcleanup only if the table receives enough insertions to update the statistics (i.g., vacuum_cleanup_index_scale_factor check). I think this is a more desirable behavior. Not skipping amvacuumcleanup if there is even one deleted page that we can recycle is very conservative. Considering your idea of keeping newly deleted pages in the meta page, I can see a little value that keeping btm_oldest_btpo_xact and making it 64-bit XID. I described below. > > Here is a plan that allows us to stop storing any kind of XID in the > metapage in all cases: > > 1. Stop maintaining the oldest XID among all deleted pages in the > entire nbtree index during VACUUM. So we can remove all of the > BTVacState.oldestBtpoXact XID tracking stuff, which is currently > something that even _bt_pagedel() needs special handling for. > > 2. Stop considering the btm_oldest_btpo_xact metapage field in > _bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only > cares about maintaining reasonably accurate statistics for the index. > Which is really how the vacuum_cleanup_index_scale_factor feature was > intended to work all along, anyway -- ISTM that the oldestBtpoXact > stuff was always just an afterthought to paper-over this annoying > 32-bit XID issue. > > 3. We cannot actually remove the btm_oldest_btpo_xact XID field from > the metapage, because of course that would change the BTMetaPageData > struct layout, which breaks on-disk compatibility. But why not use it > for something useful instead? _bt_update_meta_cleanup_info() can use > the same field to store the number of "newly deleted" pages from the > last btbulkdelete() instead. (See my email from earlier for the > definition of "newly deleted".) > > 4. Now _bt_vacuum_needs_cleanup() can once again consider the > btm_oldest_btpo_xact metapage field -- except in a totally different > way, because now it means something totally different: "newly deleted > pages during last btbulkdelete() call" (per item 3). If this # pages > is very high then we probably should do a full call to btvacuumscan() > -- _bt_vacuum_needs_cleanup() will return true to make that happen. > > It's unlikely but still possible that a high number of "newly deleted > pages during the last btbulkdelete() call" is in itself a good enough > reason to do a full btvacuumscan() call when the question of calling > btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4 > here conservatively covers that. Maybe the 32-bit-XID-in-metapage > triggering condition had some non-obvious value due to a natural > tendency for it to limit the number of deleted pages that go > unrecycled for a long time. (Or maybe there never really was any such > natural tendency -- still seems like a good idea to make the change > described by item 4.) > > Even though we are conservative (at least in this sense I just > described), we nevertheless don't actually care about very old deleted > pages that we have not yet recycled -- provided there are not very > many of them. I'm thinking of "~2% of index" as the new "newly deleted > during last btbulkdelete() call" threshold applied within > _bt_vacuum_needs_cleanup(). There is no good reason why older > deleted-but-not-yet-recycled pages should be considered more valuable > than any other page that can be used when there is a page split. Interesting. I like this idea that triggers btvacuumscan() if there are many newly deleted pages. I think this would be helpful especially for the case of bulk-deletion on the table. But why we use the number of *newly* deleted pages but not the total number of deleted pages in the index? IIUC if several btbulkdelete executions deleted index pages less than 2% of the index and those deleted pages could not be recycled yet, then the number of recyclable pages would exceed 2% of the index in total but amvacuumcleanup() would not trigger btvacuumscan() because the last newly deleted pages are less than the 2% threshold. I might be missing something though. Also, we need to note that having newly deleted pages doesn't necessarily mean these always are recyclable at that time. If the global xmin is still older than deleted page's btpo.xact values, we still could not recycle them. I think btm_oldest_btpo_xact probably will help this case. That is, we store the oldest btpo.xact among those newly deleted pages to btm_oldest_btpo_xact and we trigger btvacuumscan() if there are many newly deleted pages (more than 2% of index) and the btm_oldest_btpo_xact is older than the global xmin (I suppose each newly deleted pages could have different btpo.xact). > > Observations about on-disk compatibility with my patch + this 4 point scheme: > > A. It doesn't matter that pg_upgrade'd indexes will have an XID value > in btm_oldest_btpo_xact that now gets incorrectly interpreted as > "newly deleted pages during last btbulkdelete() call" under the 4 > point scheme I just outlined. > > The spurious value will get cleaned up on the next VACUUM anyway > (whether VACUUM goes through btbulkdelete() or through > btvacuumcleanup()). Besides, most indexes always have a > btm_oldest_btpo_xact value of 0. > > B. The patch I posted earlier doesn't actually care about the > BTREE_VERSION of the index at all. And neither does any of the stuff I > just described for a future v2 of my patch. > > All indexes can use the new format for deleted pages. On-disk > compatibility is easy here because the contents of deleted pages only > need to work as a tombstone. We can safely assume that old-format > deleted pages (pre-Postgres 14 format deleted pages) must be safe to > recycle, because the pg_upgrade itself restarts Postgres. There can be > no backends that have dangling references to the old-format deleted > page. > > C. All supported nbtree versions (all nbtree versions > BTREE_MIN_VERSION+) get the same benefits under this scheme. > > Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to > BTREE_NOVAC_VERSION/version 3 indexes via a call to > _bt_upgrademetapage() -- that has been the case since BTREE_VERSION > was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's > vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will > have the btm_oldest_btpo_xact metapage field that I now propose to > reuse to track "newly deleted pages during last btbulkdelete() call", > per point 4. > > In summary: There are no special cases here. No BTREE_VERSION related > difficulties. That seems like a huge advantage to me. Great! I'll look at the v2 patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Feb 9, 2021 at 11:58 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Thanks for picking this up! I actually had a patch for this in 2019, albeit one that remained in rough shape until recently. Must have forgotten about it. > Is it really worth the trouble to maintain 'level' on deleted pages? All > you currently do with it is check that the BTP_LEAF flag is set iff > "level == 0", which seems pointless. I guess there could be some > forensic value in keeping 'level', but meh. What trouble is that? The only way in which it's inconvenient is that we have to include the level field in xl_btree_unlink_page WAL records for the first time. The structure of the relevant REDO routine (which is called btree_xlog_unlink_page()) ought to explicitly recreate the original page from scratch, without any special cases. This makes it possible to pretend that there never was such a thing as an nbtree page whose level field could not be relied on. I personally think that it's simpler when seen in the wider context of how the code works and is verified. Besides, there is also amcheck to consider. I am a big believer in amcheck, and see it as something that has enabled my work on the B-Tree code over the past few years. Preserving the level field in deleted pages increases our coverage just a little, and practically eliminates cases where we cannot rely on the level field. Of course it's still true that this detail (the deleted pages level field question) will probably never seem important to anybody else. To me it's one small detail of a broader strategy. No one detail of that broader strategy, taken in isolation, will ever be crucially important. Of course it's also true that we should not assume that a very high cost in performance/code/whatever can justify a much smaller benefit in amcheck. But you haven't really explained why the cost seems unacceptable to you. (Perhaps I missed something.) > How about: > > INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages > DETAIL: 250000 index row versions and 686 pages were removed. > 2056 index pages are now unused, 1370 are currently reusable. > > The idea is that the first DETAIL line now says what the VACUUM did this > round, and the last line says what the state of the index is now. One > concern with that phrasing is that it might not be clear what "686 pages > were removed" means. > It's still a bit weird that the "what VACUUM did this round" information > is sandwiched between the two other lines that talk about the state of > the index after the operation. But I think the language now makes it > more clear which is which. IMV our immediate goal for the new VACUUM VERBOSE output should be to make the output as accurate and descriptive as possible (while still using terminology that works for all index AMs, not just nbtree). I don't think that we should give too much weight to making the information easy to understand in isolation. Because that's basically impossible -- it just doesn't work that way IME. Confusion over the accounting of "deleted pages in indexes" vs "pages deleted by this VACUUM" is not new. See my bugfix commit 73a076b0 to see one vintage example. The relevant output of VACUUM VERBOSE produced inconsistent results for perhaps as long as 15 years before I noticed it and fixed it. I somehow didn't notice this despite using it for various tests for my own B-Tree projects a year or two before the fix. Tests that produced inconsistent results that I noticed pretty early on, and yet assumed were all down to some subtlety that I didn't yet understand. My point is this: I am quite prepared to admit that these details really are complicated. But that's not incidental to what's really going on, or anything (though I agree with your later remarks on the general tidiness of VACUUM VERBOSE -- it is a real dog's dinner). I'm not saying that we should assume that no DBA will find the relevant VACUUM VERBOSE output useful -- I don't think that at all. It will be kind of rare for a user to really comb through it. But that's mostly because big problems in this area are themselves kind of rare (most individual indexes never have any deleted pages IME). Any DBA consuming this output sensibly will consume it in a way that makes sense in the *context of the problem that they're experiencing*, whatever that might mean for them. They'll consider how it changes over time for the same index. They'll try to correlate it with other symptoms, or other problems, and make sense of it in a top-down fashion. We should try to make it as descriptive as possible so that DBAs will have the breadcrumbs they need to tie it back to whatever the core issue happens to be -- maybe they'll have to read the source code to get to the bottom of it. It's likely to be some rare issue in those cases where the DBA really cares about the details -- it's likely to be workload dependent. Good DBAs spend much of their time on exceptional problems -- all the easy problems will have been automated away already. Things like wait events are popular with DBAs for this reason. > Or perhaps flip the INFO and first DETAIL > lines around like this: > INFO: 250000 index row versions and 686 pages were removed from index > "foo_pkey" > DETAIL: index now contains 250001 row versions in 2745 pages. > 2056 index pages are now unused, of which 1370 are currently reusable. > > For context, the more full message you get on master is: > That's pretty confusing, it's a mix of basically progress indicators > (vacuuming "public.foo"), CPU measurements, information about what was > removed, and what the state is afterwards. I agree that the output of VACUUM VERBOSE is messy. It's probably a bunch of accretions that made sense in isolation, but added up to a big mess over time. So I agree: now would be a good time to do something about that. It would also be nice to find a way to get this information in the logs when log_autovacuum is enabled (perhaps only when the verbosity is increased). I've discussed this with Masahiko in the context of his recent work, actually. Even before we started talking about the XID page deletion problem that I'm fixing here. > INFO: "foo_pkey": removed 250000 index row versions and 686 pages > DETAIL: index now contains 250001 row versions in 2745 pages. > 2056 index pages are now unused, of which 1370 are currently reusable. I can see what you mean here, and maybe we should do roughly what you've outlined. Still, we should use terminology that isn't too far removed from what actually happens in nbtree. What's a "removed" page? The distinction between all of the different kinds of index pages that might be involved here is just subtle. Again, better to use a precise, descriptive term that nobody fully understands -- because hardly anybody will fully understand it anyway (even including advanced users that go on to find the VACUUM VERBOSE output very useful for whatever reason). -- Peter Geoghegan
On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for working on this! I'm glad that I finally found time for it! It seems like it'll make things easier elsewhere. Attached is v3 of the index. I'll describe the changes I made in more detail in my response to your points below. > I agree that btm_oldest_btpo_xact will no longer be necessary in terms > of recycling deleted pages. Cool. > Given that we can guarantee that deleted pages never be leaked by > using 64-bit XID, I also think we don't need this value. We can do > amvacuumcleanup only if the table receives enough insertions to update > the statistics (i.g., vacuum_cleanup_index_scale_factor check). I > think this is a more desirable behavior. Not skipping amvacuumcleanup > if there is even one deleted page that we can recycle is very > conservative. > > Considering your idea of keeping newly deleted pages in the meta page, > I can see a little value that keeping btm_oldest_btpo_xact and making > it 64-bit XID. I described below. > Interesting. > > I like this idea that triggers btvacuumscan() if there are many newly > deleted pages. I think this would be helpful especially for the case > of bulk-deletion on the table. But why we use the number of *newly* > deleted pages but not the total number of deleted pages in the index? I was unclear here -- I should not have said "newly deleted" pages at all. What I actually do when calling _bt_vacuum_needs_cleanup() is this (from v3, at the end of btvacuumscan()): - _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact, + Assert(stats->pages_deleted >= stats->pages_free); + pages_deleted_not_free = stats->pages_deleted - stats->pages_free; + _bt_update_meta_cleanup_info(rel, pages_deleted_not_free, info->num_heap_tuples); We're actually passing something I have called "pages_deleted_not_free" here, which is derived from the bulk delete stats in the obvious way that you see here (subtraction). I'm not using pages_newly_deleted at all now. Note also that the behavior inside _bt_update_meta_cleanup_info() no longer varies based on whether it is called during btvacuumcleanup() or during btbulkdelete() -- the same rules apply either way. We want to store pages_deleted_not_free in the metapage at the end of btvacuumscan(), no matter what. This same pages_deleted_not_free information is now used by _bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too high (over 2.5%), then that will trigger a call to btbulkdelete() (we won't skip scanning the index). Though in practice it probably won't come up that often -- there just aren't ever that many deleted pages in most indexes. > IIUC if several btbulkdelete executions deleted index pages less than > 2% of the index and those deleted pages could not be recycled yet, > then the number of recyclable pages would exceed 2% of the index in > total but amvacuumcleanup() would not trigger btvacuumscan() because > the last newly deleted pages are less than the 2% threshold. I might > be missing something though. I think you're right -- my idea of varying the behavior of _bt_update_meta_cleanup_info() based on whether it's being called during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW half the problem was that I explained the idea badly to begin with). But, as I said, it's fixed in v3: we simply pass "pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup() now. Does that make sense? Does it address this concern? > Also, we need to note that having newly deleted pages doesn't > necessarily mean these always are recyclable at that time. If the > global xmin is still older than deleted page's btpo.xact values, we > still could not recycle them. I think btm_oldest_btpo_xact probably > will help this case. That is, we store the oldest btpo.xact among > those newly deleted pages to btm_oldest_btpo_xact and we trigger > btvacuumscan() if there are many newly deleted pages (more than 2% of > index) and the btm_oldest_btpo_xact is older than the global xmin (I > suppose each newly deleted pages could have different btpo.xact). I agree that having no XID in the metapage creates a new small problem. Specifically, there are certain narrow cases that can cause confusion in _bt_vacuum_needs_cleanup(). These cases didn't really exist before my patch (kind of). The simplest example is easy to run into when debugging the patch on your laptop. Because you're using your personal laptop, and not a real production server, there will be no concurrent sessions that might consume XIDs. You can run VACUUM VERBOSE manually several times, but that alone will never be enough to enable VACUUM to recycle any of the pages that the first VACUUM manages to delete (many to mark deleted, reporting the pages as "newly deleted" via the new instrumentation from the second patch). Note that the master branch is *also* unable to recycle these deleted pages, simply because the "safe xid" never gets old because there are no newly allocated XIDs to make it look old (there are no allocated XIDs just because nothing else happens). That in itself is not the new problem. The new problem is that _bt_vacuum_needs_cleanup() will no longer notice that the oldest XID among deleted-but-not-yet-recycled pages is so old that it will not be able to recycle the pages anyway -- at least not the oldest page, though in this specific case that will apply to all deleted pages equally. We might as well not bother trying yet, which the old code "gets right" -- but it doesn't get it right for any good reason. That is, the old code won't have VACUUM scan the index at all, so it "wins" in this specific scenario. I think that's okay, though -- it's not a real problem, and actually makes sense and has other advantages. This is why I believe it's okay: * We really should never VACUUM the same table before even one or two XIDs are allocated -- that's what happens in the simple laptop test scenario that I described. Surely we should not be too concerned about "doing the right thing" under this totally artificial set of conditions. (BTW, I've been using txid_current() for my own "laptop testing", as a way to work around this issue.) * More generally, if you really can't do recycling of pages that you deleted during the last VACUUM during this VACUUM (perhaps because of the presence of a long-running xact that holds open a snapshot), then you have lots of *huge* problems already, and this is the least of your concerns. Besides, at that point an affected VACUUM will be doing work for an affected index through a btbulkdelete() call, so the behavior of _bt_vacuum_needs_cleanup() becomes irrelevant. * As you pointed out already, the oldest XID/deleted page from the index may be significantly older than the newest. Why should we bucket them together? We could easily have a case where most of the deleted pages can be recycled -- even when all indexes were originally marked deleted by the same VACUUM operation. If there are lots of pages that actually can be recycled, it is probably a bad thing to assume that the oldest XID is representative of all of them. After all, with the patch we only go out of our way to recycle deleted pages when we are almost sure that the total number of recyclable pages (pages marked deleted during a previous VACUUM) exceeds 2.5% of the total size of the index. That broad constraint is important here -- if we do nothing unless there are lots of deleted pages anyway, we are highly unlikely to ever err on the side of being too eager (not eager enough seems more likely to me). I think that we're justified in making a general assumption inside _bt_vacuum_needs_cleanup() (which is documented at the point that we call it, inside btvacuumscan()): The assumption that however many index pages the metapage says we'll be able to recycle (whatever the field says) will in fact turn out to be recyclable if we decide that we need to. There are specific cases where that will be kind of wrong, as I've gone into, but the assumption/design has many more advantages than disadvantages. I have tried to capture this in v3 of the patch. Can you take a look? See the new comments inside _bt_vacuum_needs_cleanup(). Plus the comments when we call it inside btvacuumscan(). Do you think that those new comments are helpful? Does this address your concern? Thanks -- Peter Geoghegan
Attachment
On Wed, Feb 10, 2021 at 7:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v3 of the index. I'll describe the changes I made in more > detail in my response to your points below. I forget to mention that v3 adds several assertions like this one: Assert(!_bt_page_recyclable(BufferGetPage(buf))); These appear at a few key points inside generic routines like _bt_getbuf(). The overall effect is that every nbtree buffer access (with the exception of buffer accesses by VACUUM) will make sure that the page that they're about to access is not recyclable (a page that an index scan lands on might be half-dead or deleted, but it had better not be recyclable). This can probably catch problems with recycling pages too early, such as the problem fixed by commit d3abbbeb back in 2012. Any similar bugs in this area that may appear in the future can be expected to be very subtle, for a few reasons. For one, a page can be recyclable but not yet entered into the FSM by VACUUM for a long time. (I could go on.) The assertions dramatically improve our chances of catching problems like that early. -- Peter Geoghegan
On Thu, Feb 11, 2021 at 12:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you for working on this! > > I'm glad that I finally found time for it! It seems like it'll make > things easier elsewhere. > > Attached is v3 of the index. I'll describe the changes I made in more > detail in my response to your points below. > > > I agree that btm_oldest_btpo_xact will no longer be necessary in terms > > of recycling deleted pages. > > Cool. > > > Given that we can guarantee that deleted pages never be leaked by > > using 64-bit XID, I also think we don't need this value. We can do > > amvacuumcleanup only if the table receives enough insertions to update > > the statistics (i.g., vacuum_cleanup_index_scale_factor check). I > > think this is a more desirable behavior. Not skipping amvacuumcleanup > > if there is even one deleted page that we can recycle is very > > conservative. > > > > Considering your idea of keeping newly deleted pages in the meta page, > > I can see a little value that keeping btm_oldest_btpo_xact and making > > it 64-bit XID. I described below. > > > Interesting. > > > > I like this idea that triggers btvacuumscan() if there are many newly > > deleted pages. I think this would be helpful especially for the case > > of bulk-deletion on the table. But why we use the number of *newly* > > deleted pages but not the total number of deleted pages in the index? > > I was unclear here -- I should not have said "newly deleted" pages at > all. What I actually do when calling _bt_vacuum_needs_cleanup() is > this (from v3, at the end of btvacuumscan()): > > - _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact, > + Assert(stats->pages_deleted >= stats->pages_free); > + pages_deleted_not_free = stats->pages_deleted - stats->pages_free; > + _bt_update_meta_cleanup_info(rel, pages_deleted_not_free, > info->num_heap_tuples); > > We're actually passing something I have called > "pages_deleted_not_free" here, which is derived from the bulk delete > stats in the obvious way that you see here (subtraction). I'm not > using pages_newly_deleted at all now. Note also that the behavior > inside _bt_update_meta_cleanup_info() no longer varies based on > whether it is called during btvacuumcleanup() or during btbulkdelete() > -- the same rules apply either way. We want to store > pages_deleted_not_free in the metapage at the end of btvacuumscan(), > no matter what. > > This same pages_deleted_not_free information is now used by > _bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too > high (over 2.5%), then that will trigger a call to btbulkdelete() (we > won't skip scanning the index). Though in practice it probably won't > come up that often -- there just aren't ever that many deleted pages > in most indexes. Thanks for your explanation. That makes sense to me. > > > IIUC if several btbulkdelete executions deleted index pages less than > > 2% of the index and those deleted pages could not be recycled yet, > > then the number of recyclable pages would exceed 2% of the index in > > total but amvacuumcleanup() would not trigger btvacuumscan() because > > the last newly deleted pages are less than the 2% threshold. I might > > be missing something though. > > I think you're right -- my idea of varying the behavior of > _bt_update_meta_cleanup_info() based on whether it's being called > during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW > half the problem was that I explained the idea badly to begin with). > But, as I said, it's fixed in v3: we simply pass > "pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup() > now. > > Does that make sense? Does it address this concern? Yes! > > > Also, we need to note that having newly deleted pages doesn't > > necessarily mean these always are recyclable at that time. If the > > global xmin is still older than deleted page's btpo.xact values, we > > still could not recycle them. I think btm_oldest_btpo_xact probably > > will help this case. That is, we store the oldest btpo.xact among > > those newly deleted pages to btm_oldest_btpo_xact and we trigger > > btvacuumscan() if there are many newly deleted pages (more than 2% of > > index) and the btm_oldest_btpo_xact is older than the global xmin (I > > suppose each newly deleted pages could have different btpo.xact). > > I agree that having no XID in the metapage creates a new small > problem. Specifically, there are certain narrow cases that can cause > confusion in _bt_vacuum_needs_cleanup(). These cases didn't really > exist before my patch (kind of). > > The simplest example is easy to run into when debugging the patch on > your laptop. Because you're using your personal laptop, and not a real > production server, there will be no concurrent sessions that might > consume XIDs. You can run VACUUM VERBOSE manually several times, but > that alone will never be enough to enable VACUUM to recycle any of the > pages that the first VACUUM manages to delete (many to mark deleted, > reporting the pages as "newly deleted" via the new instrumentation > from the second patch). Note that the master branch is *also* unable > to recycle these deleted pages, simply because the "safe xid" never > gets old because there are no newly allocated XIDs to make it look old > (there are no allocated XIDs just because nothing else happens). That > in itself is not the new problem. > > The new problem is that _bt_vacuum_needs_cleanup() will no longer > notice that the oldest XID among deleted-but-not-yet-recycled pages is > so old that it will not be able to recycle the pages anyway -- at > least not the oldest page, though in this specific case that will > apply to all deleted pages equally. We might as well not bother trying > yet, which the old code "gets right" -- but it doesn't get it right > for any good reason. That is, the old code won't have VACUUM scan the > index at all, so it "wins" in this specific scenario. I'm on the same page. > > I think that's okay, though -- it's not a real problem, and actually > makes sense and has other advantages. This is why I believe it's okay: > > * We really should never VACUUM the same table before even one or two > XIDs are allocated -- that's what happens in the simple laptop test > scenario that I described. Surely we should not be too concerned about > "doing the right thing" under this totally artificial set of > conditions. Right. > > (BTW, I've been using txid_current() for my own "laptop testing", as a > way to work around this issue.) > > * More generally, if you really can't do recycling of pages that you > deleted during the last VACUUM during this VACUUM (perhaps because of > the presence of a long-running xact that holds open a snapshot), then > you have lots of *huge* problems already, and this is the least of > your concerns. Besides, at that point an affected VACUUM will be doing > work for an affected index through a btbulkdelete() call, so the > behavior of _bt_vacuum_needs_cleanup() becomes irrelevant. > I agree that there already are huge problems in that case. But I think we need to consider an append-only case as well; after bulk deletion on an append-only table, vacuum deletes heap tuples and index tuples, marking some index pages as dead and setting an XID into btpo.xact. Since we trigger autovacuums even by insertions based on autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on the table again. But if there is a long-running query a "wasted" cleanup scan could happen many times depending on the values of autovacuum_vacuum_insert_scale_factor/threshold and vacuum_cleanup_index_scale_factor. This should not happen in the old code. I agree this is DBA problem but it also means this could bring another new problem in a long-running query case. > * As you pointed out already, the oldest XID/deleted page from the > index may be significantly older than the newest. Why should we bucket > them together? I agree with this point. > > We could easily have a case where most of the deleted pages can be > recycled -- even when all indexes were originally marked deleted by > the same VACUUM operation. If there are lots of pages that actually > can be recycled, it is probably a bad thing to assume that the oldest > XID is representative of all of them. After all, with the patch we > only go out of our way to recycle deleted pages when we are almost > sure that the total number of recyclable pages (pages marked deleted > during a previous VACUUM) exceeds 2.5% of the total size of the index. > That broad constraint is important here -- if we do nothing unless > there are lots of deleted pages anyway, we are highly unlikely to ever > err on the side of being too eager (not eager enough seems more likely > to me). > > I think that we're justified in making a general assumption inside > _bt_vacuum_needs_cleanup() (which is documented at the point that we > call it, inside btvacuumscan()): The assumption that however many > index pages the metapage says we'll be able to recycle (whatever the > field says) will in fact turn out to be recyclable if we decide that > we need to. There are specific cases where that will be kind of wrong, > as I've gone into, but the assumption/design has many more advantages > than disadvantages. > > I have tried to capture this in v3 of the patch. Can you take a look? > See the new comments inside _bt_vacuum_needs_cleanup(). Plus the > comments when we call it inside btvacuumscan(). I basically agreed with the change made in v3 patch. But I think it's probably worth having a discussion on append-only table cases with autovacuums triggered by autovacuum_vacuum_insert_scale_factor/threshold. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I agree that there already are huge problems in that case. But I think > we need to consider an append-only case as well; after bulk deletion > on an append-only table, vacuum deletes heap tuples and index tuples, > marking some index pages as dead and setting an XID into btpo.xact. > Since we trigger autovacuums even by insertions based on > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on > the table again. But if there is a long-running query a "wasted" > cleanup scan could happen many times depending on the values of > autovacuum_vacuum_insert_scale_factor/threshold and > vacuum_cleanup_index_scale_factor. This should not happen in the old > code. I agree this is DBA problem but it also means this could bring > another new problem in a long-running query case. I see your point. This will only not be a problem with the old code because the oldest XID in the metapage happens to restrict VACUUM in what turns out to be exactly perfect. But why assume that? It's actually rather unlikely that we won't be able to free even one block, even in this scenario. The oldest XID isn't truly special -- at least not without the restrictions that go with 32-bit XIDs. The other thing is that vacuum_cleanup_index_scale_factor is mostly about limiting how long we'll go before having stale statistics, and so presumably the user gets the benefit of not having stale statistics (maybe that theory is a bit questionable in some cases, but that doesn't have all that much to do with page deletion -- in fact the problem exists without page deletion ever occuring). BTW, I am thinking about making recycling take place for pages that were deleted during the same VACUUM. We can just use a work_mem-limited array to remember a list of blocks that are deleted but not yet recyclable (plus the XID found in the block). At the end of the VACUUM, (just before calling IndexFreeSpaceMapVacuum() from within btvacuumscan()), we can then determine which blocks are now safe to recycle, and recycle them after all using some "late" calls to RecordFreeIndexPage() (and without revisiting the pages a second time). No need to wait for the next VACUUM to recycle pages this way, at least in many common cases. The reality is that it usually doesn't take very long for a deleted page to become recyclable -- why wait? This idea is enabled by commit c79f6df75dd from 2018. I think it's the next logical step. -- Peter Geoghegan
сб, 13 февр. 2021 г. в 05:39, Masahiko Sawada <sawada.mshk@gmail.com>:
> (BTW, I've been using txid_current() for my own "laptop testing", as a
> way to work around this issue.)
>
> * More generally, if you really can't do recycling of pages that you
> deleted during the last VACUUM during this VACUUM (perhaps because of
> the presence of a long-running xact that holds open a snapshot), then
> you have lots of *huge* problems already, and this is the least of
> your concerns. Besides, at that point an affected VACUUM will be doing
> work for an affected index through a btbulkdelete() call, so the
> behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.
>
I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.
After the delete, planning queries takes a long time, due to MergeJoin estimates
are using IndexScans ( see https://postgr.es/m/17467.1426090533@sss.pgh.pa.us )
are using IndexScans ( see https://postgr.es/m/17467.1426090533@sss.pgh.pa.us )
Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate this.
So I would, actually, like it very much for VACUUM to kick in sooner in such cases.
--
Victor Yegorov
On Fri, Feb 12, 2021 at 10:27 PM Victor Yegorov <vyegorov@gmail.com> wrote: > I'd like to outline one relevant case. > > Quite often bulk deletes are done on a time series data (oldest) and effectively > removes a continuous chunk of data at the (physical) beginning of the table, > this is especially true for the append-only tables. > After the delete, planning queries takes a long time, due to MergeJoin estimates > are using IndexScans ( see https://postgr.es/m/17467.1426090533@sss.pgh.pa.us ) > Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate this. > > So I would, actually, like it very much for VACUUM to kick in sooner in such cases. Masahiko was specifically concerned about workloads with bursty/uneven/mixed VACUUM triggering conditions -- he mentioned autovacuum_vacuum_insert_scale_factor/threshold as being applied to trigger a second VACUUM (which follows from an initial VACUUM that performs deletions following a bulk DELETE). A VACUUM that needs to delete index tuples will do its btvacuumscan() through the btbulkdelete() path, not through the btvacuumcleanup() "cleanup only" path. The btbulkdelete() path won't ever call _bt_vacuum_needs_cleanup() in the first place, and so there can be no risk that the relevant changes (changes that the patch makes to that function) will have some new bad effect. The problem that you have described seems very real, but it doesn't seem relevant to the specific scenario that Masahiko expressed concern about. Nor does it seem relevant to this patch more generally. -- Peter Geoghegan
On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I agree that there already are huge problems in that case. But I think > > we need to consider an append-only case as well; after bulk deletion > > on an append-only table, vacuum deletes heap tuples and index tuples, > > marking some index pages as dead and setting an XID into btpo.xact. > > Since we trigger autovacuums even by insertions based on > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on > > the table again. But if there is a long-running query a "wasted" > > cleanup scan could happen many times depending on the values of > > autovacuum_vacuum_insert_scale_factor/threshold and > > vacuum_cleanup_index_scale_factor. This should not happen in the old > > code. I agree this is DBA problem but it also means this could bring > > another new problem in a long-running query case. > > I see your point. My guess is that this concern of yours is somehow related to how we do deletion and recycling *in general*. Currently (and even in v3 of the patch), we assume that recycling the pages that a VACUUM operation deletes will happen "eventually". This kind of makes sense when you have "typical vacuuming" -- deletes/updates, and no big bursts, rare bulk deletes, etc. But when you do have a mixture of different triggering positions, which is quite possible, it is difficult to understand what "eventually" actually means... > BTW, I am thinking about making recycling take place for pages that > were deleted during the same VACUUM. We can just use a > work_mem-limited array to remember a list of blocks that are deleted > but not yet recyclable (plus the XID found in the block). ...which brings me back to this idea. I've prototyped this. It works really well. In most cases the prototype makes VACUUM operations with nbtree index page deletions also recycle the pages that were deleted, at the end of the btvacuumscan(). We do very little or no "indefinite deferring" work here. This has obvious advantages, of course, but it also has a non-obvious advantage: the awkward question of concerning "what eventually actually means" with mixed triggering conditions over time mostly goes away. So perhaps this actually addresses your concern, Masahiko. I've been testing this with BenchmarkSQL [1], which has several indexes that regularly need page deletions. There is also a realistic "life cycle" to the data in these indexes. I added custom instrumentation to display information about what's going on with page deletion when the benchmark is run. I wrote a quick-and-dirty patch that makes log_autovacuum show the same information that you see about index page deletion when VACUUM VERBOSE is run (including the new pages_newly_deleted field from my patch). With this particular TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go on to place every page that it deletes in the FSM without leaving anything to the next VACUUM. There are a very small number of exceptions where we "only" manage to recycle maybe 95% of the pages that were deleted. The race condition that nbtree avoids by deferring recycling was always a narrow one, outside of the extremes -- the way we defer has always been overkill. It's almost always unnecessary to delay placing deleted pages in the FSM until the *next* VACUUM. We only have to delay it until the end of the *same* VACUUM -- why wait until the next VACUUM if we don't have to? In general this deferring recycling business has nothing to do with MVCC/GC/whatever, and yet the code seems to suggest that it does. While it is convenient to use an XID for page deletion and recycling as a way of implementing what Lanin & Shasha call "the drain technique" [2], all we have to do is prevent certain race conditions. This is all about the index itself, the data structure, how it is maintained -- nothing more. It almost seems obvious to me. It's still possible to imagine extremes. Extremes that even the "try to recycle pages we ourselves deleted when we reach the end of btvacuumscan()" version of my patch cannot deal with. Maybe it really is true that it's inherently impossible to recycle a deleted page even at the end of a VACUUM -- maybe a long-running transaction (that could in principle have a stale link to our deleted page) starts before we VACUUM, and lasts after VACUUM finishes. So it's just not safe. When that happens, we're back to having the original problem: we're relying on some *future* VACUUM operation to do that for us at some indefinite point in the future. It's fair to wonder: What are the implications of that? Are we not back to square one? Don't we have the same "what does 'eventually' really mean" problem once again? I think that that's okay, because this remaining case is a *truly* extreme case (especially with a large index, where index vacuuming will naturally take a long time). It will be rare. But more importantly, the fact that scenario is now an extreme case justifies treating it as an extreme case. We can teach _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In particular, I think that it will now be okay to increase the threshold applied when considering deleted pages inside _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the patch. But in v4, which has the new recycling enhancement, I think that it would be sensible to make it 5%, or maybe even 10%. This naturally makes Masahiko's problem scenario unlikely to actually result in a truly wasted call to btvacuumscan(). The number of pages that the metapage indicates are "deleted but not yet placed in the FSM" will be close to the theoretical minimum, because we're no longer naively throwing away information about which specific pages will be recyclable soon. Which is what the current approach does, really. [1] https://github.com/wieck/benchmarksql [2] https://archive.org/stream/symmetricconcurr00lani#page/8/mode/2up -- see "2.5 Freeing Empty Nodes" -- Peter Geoghegan
On Sat, Feb 13, 2021 at 10:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > It will be rare. But more importantly, the fact that scenario is now > an extreme case justifies treating it as an extreme case. We can teach > _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In > particular, I think that it will now be okay to increase the threshold > applied when considering deleted pages inside > _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the > patch. But in v4, which has the new recycling enhancement, I think > that it would be sensible to make it 5%, or maybe even 10%. This > naturally makes Masahiko's problem scenario unlikely to actually > result in a truly wasted call to btvacuumscan(). Attached is v4, which has the "recycle pages that we ourselves deleted during this same VACUUM operation" enhancement. It also doubles the _bt_vacuum_needs_cleanup() threshold applied to deleted pages -- it goes from 2.5% to 5%. The new patch is the patch series (v4-0002-*) certainly needs more polishing. I'm posting what I have now because v3 has bitrot. Benchmarking has shown that the enhancement in v4-0002-* can significantly reduce the amount of index bloat in two of the BenchmarkSQL/TPC-C indexes. -- Peter Geoghegan
Attachment
On Sun, Feb 14, 2021 at 3:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I agree that there already are huge problems in that case. But I think > > > we need to consider an append-only case as well; after bulk deletion > > > on an append-only table, vacuum deletes heap tuples and index tuples, > > > marking some index pages as dead and setting an XID into btpo.xact. > > > Since we trigger autovacuums even by insertions based on > > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on > > > the table again. But if there is a long-running query a "wasted" > > > cleanup scan could happen many times depending on the values of > > > autovacuum_vacuum_insert_scale_factor/threshold and > > > vacuum_cleanup_index_scale_factor. This should not happen in the old > > > code. I agree this is DBA problem but it also means this could bring > > > another new problem in a long-running query case. > > > > I see your point. > > My guess is that this concern of yours is somehow related to how we do > deletion and recycling *in general*. Currently (and even in v3 of the > patch), we assume that recycling the pages that a VACUUM operation > deletes will happen "eventually". This kind of makes sense when you > have "typical vacuuming" -- deletes/updates, and no big bursts, rare > bulk deletes, etc. > > But when you do have a mixture of different triggering positions, > which is quite possible, it is difficult to understand what > "eventually" actually means... > > > BTW, I am thinking about making recycling take place for pages that > > were deleted during the same VACUUM. We can just use a > > work_mem-limited array to remember a list of blocks that are deleted > > but not yet recyclable (plus the XID found in the block). > > ...which brings me back to this idea. > > I've prototyped this. It works really well. In most cases the > prototype makes VACUUM operations with nbtree index page deletions > also recycle the pages that were deleted, at the end of the > btvacuumscan(). We do very little or no "indefinite deferring" work > here. This has obvious advantages, of course, but it also has a > non-obvious advantage: the awkward question of concerning "what > eventually actually means" with mixed triggering conditions over time > mostly goes away. So perhaps this actually addresses your concern, > Masahiko. Yes. I think this would simplify the problem by resolving almost all problems related to indefinite deferring page recycle. We will be able to recycle almost all just-deleted pages in practice especially when btvacuumscan() took a long time. And there would not be a noticeable downside, I think. BTW if btree index starts to use maintenan_work_mem for this purpose, we also need to set amusemaintenanceworkmem to true which is considered when parallel vacuum. > > I've been testing this with BenchmarkSQL [1], which has several > indexes that regularly need page deletions. There is also a realistic > "life cycle" to the data in these indexes. I added custom > instrumentation to display information about what's going on with page > deletion when the benchmark is run. I wrote a quick-and-dirty patch > that makes log_autovacuum show the same information that you see about > index page deletion when VACUUM VERBOSE is run (including the new > pages_newly_deleted field from my patch). With this particular > TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go > on to place every page that it deletes in the FSM without leaving > anything to the next VACUUM. There are a very small number of > exceptions where we "only" manage to recycle maybe 95% of the pages > that were deleted. Great! > > The race condition that nbtree avoids by deferring recycling was > always a narrow one, outside of the extremes -- the way we defer has > always been overkill. It's almost always unnecessary to delay placing > deleted pages in the FSM until the *next* VACUUM. We only have to > delay it until the end of the *same* VACUUM -- why wait until the next > VACUUM if we don't have to? In general this deferring recycling > business has nothing to do with MVCC/GC/whatever, and yet the code > seems to suggest that it does. While it is convenient to use an XID > for page deletion and recycling as a way of implementing what Lanin & > Shasha call "the drain technique" [2], all we have to do is prevent > certain race conditions. This is all about the index itself, the data > structure, how it is maintained -- nothing more. It almost seems > obvious to me. Agreed. > > It's still possible to imagine extremes. Extremes that even the "try > to recycle pages we ourselves deleted when we reach the end of > btvacuumscan()" version of my patch cannot deal with. Maybe it really > is true that it's inherently impossible to recycle a deleted page even > at the end of a VACUUM -- maybe a long-running transaction (that could > in principle have a stale link to our deleted page) starts before we > VACUUM, and lasts after VACUUM finishes. So it's just not safe. When > that happens, we're back to having the original problem: we're relying > on some *future* VACUUM operation to do that for us at some indefinite > point in the future. It's fair to wonder: What are the implications of > that? Are we not back to square one? Don't we have the same "what does > 'eventually' really mean" problem once again? > > I think that that's okay, because this remaining case is a *truly* > extreme case (especially with a large index, where index vacuuming > will naturally take a long time). Right. > > It will be rare. But more importantly, the fact that scenario is now > an extreme case justifies treating it as an extreme case. We can teach > _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In > particular, I think that it will now be okay to increase the threshold > applied when considering deleted pages inside > _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the > patch. But in v4, which has the new recycling enhancement, I think > that it would be sensible to make it 5%, or maybe even 10%. This > naturally makes Masahiko's problem scenario unlikely to actually > result in a truly wasted call to btvacuumscan(). The number of pages > that the metapage indicates are "deleted but not yet placed in the > FSM" will be close to the theoretical minimum, because we're no longer > naively throwing away information about which specific pages will be > recyclable soon. Which is what the current approach does, really. > Yeah, increasing the threshold would solve the problem in most cases. Given that nbtree index page deletion is unlikely to happen in practice, having the threshold 5% or 10% seems to avoid the problem in nearly 100% of cases, I think. Another idea I come up with (maybe on top of above your idea) is to change btm_oldest_btpo_xact to 64-bit XID and store the *newest* btpo.xact XID among all deleted pages when the total amount of deleted pages exceeds 2% of index. That way, we surely can recycle more than 2% of index when the XID becomes older than the global xmin. Also, maybe we can record deleted pages to FSM even without deferring and check it when re-using. That is, when we get a free page from FSM we check if the page is really recyclable (maybe _bt_getbuf() already does this?). IOW, a deleted page can be recycled only when it's requested to be reused. If btpo.xact is 64-bit XID we never need to worry about the case where a deleted page never be requested to be reused. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Yes. I think this would simplify the problem by resolving almost all > problems related to indefinite deferring page recycle. > > We will be able to recycle almost all just-deleted pages in practice > especially when btvacuumscan() took a long time. And there would not > be a noticeable downside, I think. Great! > BTW if btree index starts to use maintenan_work_mem for this purpose, > we also need to set amusemaintenanceworkmem to true which is > considered when parallel vacuum. I was just going to use work_mem. This should be okay. Note that CREATE INDEX uses an additional work_mem allocation when building a unique index, for the second spool/tuplesort. That seems like a precedent that I can follow here. Right now the BTPendingRecycle struct the patch uses to store information about a page that the current VACUUM deleted (and may yet be able to place in the FSM) are each 16 bytes (including alignment overhead). I could probably make them smaller with a little work, but even now that's quite small. Even with the default 4MiB work_mem setting we can fit information about 262144 pages all at once. That's 2GiB worth of deleted index pages, which is generally much more than we'll need. > Yeah, increasing the threshold would solve the problem in most cases. > Given that nbtree index page deletion is unlikely to happen in > practice, having the threshold 5% or 10% seems to avoid the problem in > nearly 100% of cases, I think. Of course it all depends on workload/index characteristics, in the end. It is very rare to delete a percentage of the index that exceeds autovacuum_vacuum_scale_factor -- that's the important thing here IMV. > Another idea I come up with (maybe on top of above your idea) is to > change btm_oldest_btpo_xact to 64-bit XID and store the *newest* > btpo.xact XID among all deleted pages when the total amount of deleted > pages exceeds 2% of index. That way, we surely can recycle more than > 2% of index when the XID becomes older than the global xmin. You could make my basic approach to recycling deleted pages earlier (ideally at the end of the same btvacuumscan() that deleted the pages in the first place) more sophisticated in a variety of ways. These are all subject to diminishing returns, though. I've already managed to recycle close to 100% of all B-Tree pages during the same VACUUM with a very simple approach -- at least if we assume BenchmarkSQL is representative. It is hard to know how much more effort can be justified. To be clear, I'm not saying that an improved design cannot be justified now or in the future (BenchmarkSQL is not like many workloads that people use Postgres for). I'm just saying that I *don't know* where to draw the line. Any particular place that we draw the line feels a little arbitrary to me. This includes my own choice of the work_mem-limited BTPendingRecycle array. My patch currently works that way because it's simple -- no other reason. Any scheme to further improve the "work_mem-limited BTPendingRecycle array" design from my patch boils down to this: A new approach that makes recycling of any remaining deleted pages take place "before too long": After the end of the btvacuumscan() BTPendingRecycle array stuff (presumably that didn't work out in cases where an improved approach matters), but before the next VACUUM takes place (since that will do the required recycling anyway, unless it's unable to do any work at all, in which case it hardly matters). Here are two ideas of my own in this same class as your idea: 1. Remember to do some of the BTPendingRecycle array FSM processing stuff in btvacuumcleanup() -- defer some of the recycling of pages recorded in BTPendingRecycle entries (paged deleted during btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called. Right now btvacuumcleanup() will always do nothing when btbulkdelete() was called earlier. But that's just a current nbtree convention, and is no reason to not do this (we don't have to scan the index again at all). The advantage of teaching btvacuumcleanup() to do this is that it delays the "BTPendingRecycle array FSM processing" stuff until the last moment that it is still easy to use the in-memory array (because we haven't freed it yet). In general, doing it later makes it more likely that we'll successfully recycle the pages. Though in general it might not make any difference -- so we're still hoping that the workload allows us to recycle everything we deleted, without making the design much more complicated than what I posted already. (BTW I see that you reviewed commit 4e514c61, so you must have thought about the trade-off between doing deferred recycling in amvacuumcleanup() vs ambulkdelete(), when to call IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot implement this idea while calling IndexFreeSpaceMapVacuum() during both btvacuumcleanup() and btbulkdelete(), so that we get the best of both worlds -- fast recycling *and* more delayed processing that is more likely to ultimately succeed.) 2. Remember/serialize the BTPendingRecycle array when we realize that we cannot put all recyclable pages in the FSM at the end of the current btvacuumscan(), and then use an autovacuum work item to process them before too long -- a call to AutoVacuumRequestWork() could even serialize the data on disk. Idea 2 has the advantage of allowing retries -- eventually it will be safe to recycle the pages, if we just wait long enough. Anyway, I'm probably not going to pursue either of the 2 ideas for Postgres 14. I'm mentioning these ideas now because the trade-offs show that there is no perfect design for this deferring recycling stuff. Whatever we do, we should accept that there is no perfect design. Actually, there is one more reason why I bring up idea 1 now: I want to hear your thoughts on the index AM API questions now, which idea 1 touches on. Ideally all of the details around the index AM VACUUM APIs (i.e. when and where the extra work happens -- btvacuumcleanup() vs btbulkdelete()) won't need to change much in the future. I worry about getting this index AM API stuff right, at least a little. > Also, maybe we can record deleted pages to FSM even without deferring > and check it when re-using. That is, when we get a free page from FSM > we check if the page is really recyclable (maybe _bt_getbuf() already > does this?). IOW, a deleted page can be recycled only when it's > requested to be reused. If btpo.xact is 64-bit XID we never need to > worry about the case where a deleted page never be requested to be > reused. I've thought about that too (both now and in the past). You're right about _bt_getbuf() -- it checks the XID, at least on the master branch. I took that XID check out in v4 of the patch, but I am now starting to have my doubts about that particular choice. (I'm probably going to restore the XID check in _bt_getbuf in v5 of the patch.) I took the XID-is-recyclable check out in v4 of the patch because it might leak pages in rare cases -- which is not a new problem. _bt_getbuf() currently has a remarkably relaxed attitude about leaking pages from the FSM (it is more relaxed about it than I am, certainly) -- but why should we just accept leaking pages like that? My new doubts about it are non-specific, though. We know that the FSM isn't crash safe -- but I think that that reduces to "practically speaking, we can never 100% trust the FSM". Which makes me nervous. I worry that the FSM can do something completely evil and crazy in rare cases. It's not just crash safety. The FSM's fsm_search_avail() function currently changes the fp_next_slot field with only a shared buffer lock held. It's an int, which is supposed to "be atomic on most platforms". But we should be using real atomic ops. So the FSM is generally...kind of wonky. In an ideal world, nbtree page deletion + recycling would have crash safety built in. I don't think that it makes sense to not have free space management without crash safety in the case of index AMs, because it's just not worth it with whole-page units of free space (heapam is another story). A 100% crash-safe design would naturally shift the problem of nbtree page recycle safety from the producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree would be a real improvement. But these long standing FSM issues are not going to change for Postgres 14. And so changing _bt_getbuf() to do clever things with XIDs won't be possible for Postgres 14 IMV. -- Peter Geoghegan
On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > Actually, there is one more reason why I bring up idea 1 now: I want > to hear your thoughts on the index AM API questions now, which idea 1 > touches on. Ideally all of the details around the index AM VACUUM APIs > (i.e. when and where the extra work happens -- btvacuumcleanup() vs > btbulkdelete()) won't need to change much in the future. I worry about > getting this index AM API stuff right, at least a little. Speaking of problems like this, I think I spotted an old one: we call _bt_update_meta_cleanup_info() in either btbulkdelete() or btvacuumcleanup(). I think that we should always call it in btvacuumcleanup(), though -- even in cases where there is no call to btvacuumscan() inside btvacuumcleanup() (because btvacuumscan() happened earlier instead, during the btbulkdelete() call). This makes the value of IndexVacuumInfo.num_heap_tuples (which is what we store in the metapage) much more accurate -- right now it's always pg_class.reltuples from *before* the VACUUM started. And so the btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often kind of inaccurate. This "estimate during ambulkdelete" issue is documented here (kind of): /* * Struct for input arguments passed to ambulkdelete and amvacuumcleanup * * num_heap_tuples is accurate only when estimated_count is false; * otherwise it's just an estimate (currently, the estimate is the * prior value of the relation's pg_class.reltuples field, so it could * even be -1). It will always just be an estimate during ambulkdelete. */ typedef struct IndexVacuumInfo { ... } The name of the metapage field is already btm_last_cleanup_num_heap_tuples, which already suggests the approach that I propose now. So why don't we do it like that already? (Thinks some more...) I wonder: did this detail change at the last minute during the development of the feature (just before commit 857f9c36) back in early 2018? That change have made it easier to deal with oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to the patch -- so maybe it's truly an accident that the code doesn't work the way that I suggest it should already. (It's annoying to make state from btbulkdelete() appear in btvacuumcleanup(), unless it's from IndexVacuumInfo or something -- I can imagine this changing at the last minute, just for that reason.) Do you think that this needs to be treated as a bug in the backbranches, Masahiko? I'm not sure... In any case we should probably make this change as part of Postgres 14. Don't you think? It's certainly easy to do it this way now, since there will be no need to keep around a oldestBtpoXact value until btvacuumcleanup() (in the common case where btbulkdelete() is where we call btvacuumscan()). The new btm_last_cleanup_num_delpages field (which replaces btm_oldest_btpo_xact) has a value that just comes from the bulk stats, which is easy anyway. -- Peter Geoghegan
On Tue, Feb 16, 2021 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Actually, there is one more reason why I bring up idea 1 now: I want > > to hear your thoughts on the index AM API questions now, which idea 1 > > touches on. Ideally all of the details around the index AM VACUUM APIs > > (i.e. when and where the extra work happens -- btvacuumcleanup() vs > > btbulkdelete()) won't need to change much in the future. I worry about > > getting this index AM API stuff right, at least a little. > > Speaking of problems like this, I think I spotted an old one: we call > _bt_update_meta_cleanup_info() in either btbulkdelete() or > btvacuumcleanup(). I think that we should always call it in > btvacuumcleanup(), though -- even in cases where there is no call to > btvacuumscan() inside btvacuumcleanup() (because btvacuumscan() > happened earlier instead, during the btbulkdelete() call). > > This makes the value of IndexVacuumInfo.num_heap_tuples (which is what > we store in the metapage) much more accurate -- right now it's always > pg_class.reltuples from *before* the VACUUM started. And so the > btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often > kind of inaccurate. > > This "estimate during ambulkdelete" issue is documented here (kind of): > > /* > * Struct for input arguments passed to ambulkdelete and amvacuumcleanup > * > * num_heap_tuples is accurate only when estimated_count is false; > * otherwise it's just an estimate (currently, the estimate is the > * prior value of the relation's pg_class.reltuples field, so it could > * even be -1). It will always just be an estimate during ambulkdelete. > */ > typedef struct IndexVacuumInfo > { > ... > } > > The name of the metapage field is already > btm_last_cleanup_num_heap_tuples, which already suggests the approach > that I propose now. So why don't we do it like that already? > > (Thinks some more...) > > I wonder: did this detail change at the last minute during the > development of the feature (just before commit 857f9c36) back in early > 2018? That change have made it easier to deal with > oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to > the patch -- so maybe it's truly an accident that the code doesn't > work the way that I suggest it should already. (It's annoying to make > state from btbulkdelete() appear in btvacuumcleanup(), unless it's > from IndexVacuumInfo or something -- I can imagine this changing at > the last minute, just for that reason.) > > Do you think that this needs to be treated as a bug in the > backbranches, Masahiko? I'm not sure... Ugh, yes, I think it's a bug. When developing this feature, in an old version patch, we used to set invalid values to both btm_oldest_btpo_xact and btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these values. But we decided to set valid values to both even in btbulkdelete(). I believe that decision was correct in terms of btm_oldest_btpo_xact because with the old version patch we will do an unnecessary index scan during btvacuumcleanup(). But it’s wrong in terms of btm_last_cleanup_num_heap_tuples, as you pointed out. This bug would make the check of vacuum_cleanup_index_scale_factor untrust. So I think it’s better to backpatch but I think we need to note that to fix this issue properly, in a case where a vacuum called btbulkdelete() earlier, probably we should update only btm_oldest_btpo_xact in btbulkdelete() and then update btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case, we don’t know the oldest btpo.xact among the deleted pages in btvacuumcleanup(). This means that we would need to update the meta page twice, leading to WAL logging twice. Since we already could update the meta page more than once when a vacuum calls btbulkdelete() multiple times I think it would not be a problem, though. > > In any case we should probably make this change as part of Postgres > 14. Don't you think? It's certainly easy to do it this way now, since > there will be no need to keep around a oldestBtpoXact value until > btvacuumcleanup() (in the common case where btbulkdelete() is where we > call btvacuumscan()). The new btm_last_cleanup_num_delpages field > (which replaces btm_oldest_btpo_xact) has a value that just comes from > the bulk stats, which is easy anyway. Agreed. As I mentioned above, we might need to consider how btbulkdelete() can tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a vacuum called btbulkdelete earlier. During parallel vacuum, two different processes could do btbulkdelete() and btvacuumcleanup() respectively. Updating those values separately in those callbacks would be straightforward. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Feb 16, 2021 at 4:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Ugh, yes, I think it's a bug. I was actually thinking of a similar bug in nbtree deduplication when I spotted this one -- see commit 48e12913. The index AM API stuff is tricky. > When developing this feature, in an old version patch, we used to set > invalid values to both btm_oldest_btpo_xact and > btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these > values. But we decided to set valid values to both even in > btbulkdelete(). I believe that decision was correct in terms of > btm_oldest_btpo_xact because with the old version patch we will do an > unnecessary index scan during btvacuumcleanup(). But it’s wrong in > terms of btm_last_cleanup_num_heap_tuples, as you pointed out. Right. > This bug would make the check of vacuum_cleanup_index_scale_factor > untrust. So I think it’s better to backpatch but I think we need to > note that to fix this issue properly, in a case where a vacuum called > btbulkdelete() earlier, probably we should update only > btm_oldest_btpo_xact in btbulkdelete() and then update > btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case, > we don’t know the oldest btpo.xact among the deleted pages in > btvacuumcleanup(). This means that we would need to update the meta > page twice, leading to WAL logging twice. Since we already could > update the meta page more than once when a vacuum calls btbulkdelete() > multiple times I think it would not be a problem, though. I agree that that approach is fine. Realistically, we won't even have to update the metapage twice in most cases. Because most indexes never have even one page deletion anyway. > As I mentioned above, we might need to consider how btbulkdelete() can > tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a > vacuum called btbulkdelete earlier. During parallel vacuum, two > different processes could do btbulkdelete() and btvacuumcleanup() > respectively. Updating those values separately in those callbacks > would be straightforward. I don't see why it should be a problem for my patch/Postgres 14, because we don't have the same btpo.xact/oldestBtpoXact issue that the original Postgres 11 commit dealt with. The patch determines a value for btm_last_cleanup_num_delpages (which I call pages_deleted_not_free) by subtracting fields from the bulk delete stats: we just use "stats->pages_deleted - stats->pages_free". Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine) entitled to rely on the bulk delete stats being set in the way I've described? I assumed that that was okay in general, but I haven't tested parallel VACUUM specifically. Will parallel VACUUM really fail to ensure that values in bulk stats fields (like pages_deleted and pages_free) get set correctly for amvacuumcleanup() callbacks? -- Peter Geoghegan
On Tue, Feb 16, 2021 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote: > Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine) > entitled to rely on the bulk delete stats being set in the way I've > described? I assumed that that was okay in general, but I haven't > tested parallel VACUUM specifically. Will parallel VACUUM really fail > to ensure that values in bulk stats fields (like pages_deleted and > pages_free) get set correctly for amvacuumcleanup() callbacks? I tested the pages_deleted_not_free stuff with a version of my patch that consistently calls _bt_update_meta_cleanup_info() during btvacuumcleanup(), and never during btbulkdelete(). And it works just fine -- including with parallel VACUUM. Evidently my understanding of what btvacuumcleanup() (or any other amvacuumcleanup() routine) can expect from bulk delete stats was correct. It doesn't matter whether or not parallel VACUUM happens to be involved -- it works just as well. This is good news, since of course it means that it's okay to stick to the simple approach of calculating pages_deleted_not_free. Passing pages_deleted_not_free (a.k.a. btm_last_cleanup_num_delpages) to _bt_update_meta_cleanup_info() during btvacuumcleanup() works just as well when combined with my fix for the the "IndexVacuumInfo.num_heap_tuples is inaccurate during btbulkdelete()" bug. That approach to fixing the IndexVacuumInfo.num_heap_tuples bug creates no new problems for my patch. There is still no need to think about when or how the relevant bulk delete fields (pages_deleted and pages_free) were set. And it doesn't matter whether or not parallel VACUUM is involved. (Of course it's also true that we can't do that on the backbranches. Purely because we must worry about btpo.xact/oldestBtpoXact on the backbranches. We'll probably have to teach the code in released versions to set btm_oldest_btpo_xact and btm_last_cleanup_num_heap_tuples in separate calls -- since there is no easy way to "send" the oldestBtpoXact value determined during a btbulkdelete() to a later corresponding btvacuumcleanup(). That's a bit of a kludge, but I'm not worried about it.) -- Peter Geoghegan
On Wed, Feb 17, 2021 at 5:41 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Feb 16, 2021 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine) > > entitled to rely on the bulk delete stats being set in the way I've > > described? I assumed that that was okay in general, but I haven't > > tested parallel VACUUM specifically. Will parallel VACUUM really fail > > to ensure that values in bulk stats fields (like pages_deleted and > > pages_free) get set correctly for amvacuumcleanup() callbacks? > > I tested the pages_deleted_not_free stuff with a version of my patch > that consistently calls _bt_update_meta_cleanup_info() during > btvacuumcleanup(), and never during btbulkdelete(). And it works just > fine -- including with parallel VACUUM. > > Evidently my understanding of what btvacuumcleanup() (or any other > amvacuumcleanup() routine) can expect from bulk delete stats was > correct. It doesn't matter whether or not parallel VACUUM happens to > be involved -- it works just as well. Yes, you're right. I missed that pages_deleted_not_free is calculated by (stats->pages_deleted - stats->pages_free) where both are in IndexBulkDeleteResult. > > This is good news, since of course it means that it's okay to stick to > the simple approach of calculating pages_deleted_not_free. Passing > pages_deleted_not_free (a.k.a. btm_last_cleanup_num_delpages) to > _bt_update_meta_cleanup_info() during btvacuumcleanup() works just as > well when combined with my fix for the the > "IndexVacuumInfo.num_heap_tuples is inaccurate during btbulkdelete()" > bug. That approach to fixing the IndexVacuumInfo.num_heap_tuples bug > creates no new problems for my patch. There is still no need to think > about when or how the relevant bulk delete fields (pages_deleted and > pages_free) were set. And it doesn't matter whether or not parallel > VACUUM is involved. Agreed. > > (Of course it's also true that we can't do that on the backbranches. > Purely because we must worry about btpo.xact/oldestBtpoXact on the > backbranches. We'll probably have to teach the code in released > versions to set btm_oldest_btpo_xact and > btm_last_cleanup_num_heap_tuples in separate calls -- since there is > no easy way to "send" the oldestBtpoXact value determined during a > btbulkdelete() to a later corresponding btvacuumcleanup(). That's a > bit of a kludge, but I'm not worried about it.) Agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Feb 16, 2021 at 12:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yes. I think this would simplify the problem by resolving almost all > > problems related to indefinite deferring page recycle. > > > > We will be able to recycle almost all just-deleted pages in practice > > especially when btvacuumscan() took a long time. And there would not > > be a noticeable downside, I think. > > Great! > > > BTW if btree index starts to use maintenan_work_mem for this purpose, > > we also need to set amusemaintenanceworkmem to true which is > > considered when parallel vacuum. > > I was just going to use work_mem. This should be okay. Note that > CREATE INDEX uses an additional work_mem allocation when building a > unique index, for the second spool/tuplesort. That seems like a > precedent that I can follow here. > > Right now the BTPendingRecycle struct the patch uses to store > information about a page that the current VACUUM deleted (and may yet > be able to place in the FSM) are each 16 bytes (including alignment > overhead). I could probably make them smaller with a little work, but > even now that's quite small. Even with the default 4MiB work_mem > setting we can fit information about 262144 pages all at once. That's > 2GiB worth of deleted index pages, which is generally much more than > we'll need. Cool. > > > Yeah, increasing the threshold would solve the problem in most cases. > > Given that nbtree index page deletion is unlikely to happen in > > practice, having the threshold 5% or 10% seems to avoid the problem in > > nearly 100% of cases, I think. > > Of course it all depends on workload/index characteristics, in the > end. It is very rare to delete a percentage of the index that exceeds > autovacuum_vacuum_scale_factor -- that's the important thing here IMV. > > > Another idea I come up with (maybe on top of above your idea) is to > > change btm_oldest_btpo_xact to 64-bit XID and store the *newest* > > btpo.xact XID among all deleted pages when the total amount of deleted > > pages exceeds 2% of index. That way, we surely can recycle more than > > 2% of index when the XID becomes older than the global xmin. > > You could make my basic approach to recycling deleted pages earlier > (ideally at the end of the same btvacuumscan() that deleted the pages > in the first place) more sophisticated in a variety of ways. These are > all subject to diminishing returns, though. > > I've already managed to recycle close to 100% of all B-Tree pages > during the same VACUUM with a very simple approach -- at least if we > assume BenchmarkSQL is representative. It is hard to know how much > more effort can be justified. To be clear, I'm not saying that an > improved design cannot be justified now or in the future (BenchmarkSQL > is not like many workloads that people use Postgres for). I'm just > saying that I *don't know* where to draw the line. Any particular > place that we draw the line feels a little arbitrary to me. This > includes my own choice of the work_mem-limited BTPendingRecycle array. > My patch currently works that way because it's simple -- no other > reason. > > Any scheme to further improve the "work_mem-limited BTPendingRecycle > array" design from my patch boils down to this: A new approach that > makes recycling of any remaining deleted pages take place "before too > long": After the end of the btvacuumscan() BTPendingRecycle array > stuff (presumably that didn't work out in cases where an improved > approach matters), but before the next VACUUM takes place (since that > will do the required recycling anyway, unless it's unable to do any > work at all, in which case it hardly matters). I agreed with this direction. > Here are two ideas of > my own in this same class as your idea: > > 1. Remember to do some of the BTPendingRecycle array FSM processing > stuff in btvacuumcleanup() -- defer some of the recycling of pages > recorded in BTPendingRecycle entries (paged deleted during > btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called. > > Right now btvacuumcleanup() will always do nothing when btbulkdelete() > was called earlier. But that's just a current nbtree convention, and > is no reason to not do this (we don't have to scan the index again at > all). The advantage of teaching btvacuumcleanup() to do this is that > it delays the "BTPendingRecycle array FSM processing" stuff until the > last moment that it is still easy to use the in-memory array (because > we haven't freed it yet). In general, doing it later makes it more > likely that we'll successfully recycle the pages. Though in general it > might not make any difference -- so we're still hoping that the > workload allows us to recycle everything we deleted, without making > the design much more complicated than what I posted already. > > (BTW I see that you reviewed commit 4e514c61, so you must have thought > about the trade-off between doing deferred recycling in > amvacuumcleanup() vs ambulkdelete(), when to call > IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot > implement this idea while calling IndexFreeSpaceMapVacuum() during > both btvacuumcleanup() and btbulkdelete(), so that we get the best of > both worlds -- fast recycling *and* more delayed processing that is > more likely to ultimately succeed.) I think this idea 1 also needs to serialize BTPendingRecycle array somewhere to pass it to a parallel vacuum worker in parallel vacuum case. Delaying the "BTPendingRecycle array FSM processing" stuff until btvacuumcleanup() is a good idea. But I think it's a relatively rare case in practice where index vacuum runs more than once (i.g., using up maintenance_work_mem). So considering the development cost of serializing BTPendingRecycle array and index AM API changes, attempting to recycle the deleted pages at the end of btvacuumscan() would be a balanced strategy. > > 2. Remember/serialize the BTPendingRecycle array when we realize that > we cannot put all recyclable pages in the FSM at the end of the > current btvacuumscan(), and then use an autovacuum work item to > process them before too long -- a call to AutoVacuumRequestWork() > could even serialize the data on disk. > > Idea 2 has the advantage of allowing retries -- eventually it will be > safe to recycle the pages, if we just wait long enough. This is a good idea too. Perhaps autovacuum needs to end up with an error to retry later again in case where it could not recycle all deleted pages. I have thought too about the idea to store pending-recycle pages somewhere to avoid index scan when we do the XID-is-recyclable check. My idea was to store them to btree pages dedicated for this purpose linked from the meta page but I prefer your idea. > > Anyway, I'm probably not going to pursue either of the 2 ideas for > Postgres 14. I'm mentioning these ideas now because the trade-offs > show that there is no perfect design for this deferring recycling > stuff. Whatever we do, we should accept that there is no perfect > design. > > Actually, there is one more reason why I bring up idea 1 now: I want > to hear your thoughts on the index AM API questions now, which idea 1 > touches on. Ideally all of the details around the index AM VACUUM APIs > (i.e. when and where the extra work happens -- btvacuumcleanup() vs > btbulkdelete()) won't need to change much in the future. I worry about > getting this index AM API stuff right, at least a little. After introducing parallel vacuum, index AMs are not able to pass arbitary information taken in ambulkdlete() to amvacuumcleanup(), like old gist index code does. If there is a good use case where needs to pass arbitary information to amvacuumcleanup(), I think it'd be a good idea to add an index AM API so that parallel vacuum serialize it and tells another parallel vacuum worker. But, as I mentinoed above, given that vacuum calls ambulkdelete() only once in most cases and I think we’d like to improve how to store TIDs in maintenance_work_mem space (discussed a little on thread[1]), delaying "the BTPendingRecycle array FSM processing stuff in btvacuumcleanup()” would not be a good usecase. > > > Also, maybe we can record deleted pages to FSM even without deferring > > and check it when re-using. That is, when we get a free page from FSM > > we check if the page is really recyclable (maybe _bt_getbuf() already > > does this?). IOW, a deleted page can be recycled only when it's > > requested to be reused. If btpo.xact is 64-bit XID we never need to > > worry about the case where a deleted page never be requested to be > > reused. > > I've thought about that too (both now and in the past). You're right > about _bt_getbuf() -- it checks the XID, at least on the master > branch. I took that XID check out in v4 of the patch, but I am now > starting to have my doubts about that particular choice. (I'm probably > going to restore the XID check in _bt_getbuf in v5 of the patch.) > > I took the XID-is-recyclable check out in v4 of the patch because it > might leak pages in rare cases -- which is not a new problem. > _bt_getbuf() currently has a remarkably relaxed attitude about leaking > pages from the FSM (it is more relaxed about it than I am, certainly) > -- but why should we just accept leaking pages like that? My new > doubts about it are non-specific, though. We know that the FSM isn't > crash safe -- but I think that that reduces to "practically speaking, > we can never 100% trust the FSM". Which makes me nervous. I worry that > the FSM can do something completely evil and crazy in rare cases. > > It's not just crash safety. The FSM's fsm_search_avail() function > currently changes the fp_next_slot field with only a shared buffer > lock held. It's an int, which is supposed to "be atomic on most > platforms". But we should be using real atomic ops. So the FSM is > generally...kind of wonky. > > In an ideal world, nbtree page deletion + recycling would have crash > safety built in. I don't think that it makes sense to not have free > space management without crash safety in the case of index AMs, > because it's just not worth it with whole-page units of free space > (heapam is another story). A 100% crash-safe design would naturally > shift the problem of nbtree page recycle safety from the > producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree > would be a real improvement. But these long standing FSM issues are > not going to change for Postgres 14. And so changing _bt_getbuf() to > do clever things with XIDs won't be possible for Postgres 14 IMV. Agreed. Thanks for your explanation. Regards, [1] https://www.postgresql.org/message-id/flat/CA%2Bfd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Agreed. Thanks for your explanation. Attached is v5, which has some of the changes I talked about. Changes from v4 include: * Now only updates metapage during btvacuumcleanup() in the first patch, which is enough to fix the existing IndexVacuumInfo.num_heap_tuples issue. * Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia. * The second patch in the series now respects work_mem when sizing the BTPendingRecycle array. * New enhancement to the XID GlobalVisCheckRemovableFullXid() test used in the second patch, to allow it to recycle even more pages. (Still unsure of some of the details here.) I would like to commit the first patch in a few days -- I refer to the big patch that makes deleted page XIDs 64-bit/full. Can you take a look at that one, Masahiko? That would be helpful. I can produce a bug fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I think that that should be written after the first patch is finalized and committed. The second patch (the new recycling optimization) will require more work and testing. Thanks! -- Peter Geoghegan
Attachment
On Fri, Feb 19, 2021 at 3:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Agreed. Thanks for your explanation. > > Attached is v5, which has some of the changes I talked about. Changes > from v4 include: > > * Now only updates metapage during btvacuumcleanup() in the first > patch, which is enough to fix the existing > IndexVacuumInfo.num_heap_tuples issue. > > * Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia. > > * The second patch in the series now respects work_mem when sizing the > BTPendingRecycle array. > > * New enhancement to the XID GlobalVisCheckRemovableFullXid() test > used in the second patch, to allow it to recycle even more pages. > (Still unsure of some of the details here.) Thank you for updating the patch! > > I would like to commit the first patch in a few days -- I refer to the > big patch that makes deleted page XIDs 64-bit/full. Can you take a > look at that one, Masahiko? That would be helpful. I can produce a bug > fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I > think that that should be written after the first patch is finalized > and committed. I'll look at the first patch first. > > The second patch (the new recycling optimization) will require more > work and testing. Then also look at those patches. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Mar 10, 2021 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > Here is another bitrot-fix-only revision, v9. Just the recycling patch again. I committed the final nbtree page deletion patch just now -- the one that attempts to make recycling happen for newly deleted pages. Thanks for all your work on patch review, Masahiko! I'll close out the CF item for this patch series now. -- Peter Geoghegan
On Mon, Mar 22, 2021 at 7:27 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Mar 10, 2021 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Here is another bitrot-fix-only revision, v9. Just the recycling patch again. > > I committed the final nbtree page deletion patch just now -- the one > that attempts to make recycling happen for newly deleted pages. Thanks > for all your work on patch review, Masahiko! You're welcome! Those are really good improvements. By this patch series, btree indexes became like hash indexes in terms of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the two cases: metapage upgrading and more than 5% deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do we want to disable parallel index cleanup for btree indexes like hash indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from amparallelvacuumoptions. IMO we can live with the current configuration just in case where the user runs into such rare situations (especially for the latter case). In most cases, parallel vacuum workers for index cleanup might exit with no-op but the side-effect (wasting resources and overhead etc) would not be big. If we want to enable it only in particular cases, we would need to have another way for index AM to tell lazy vacuum whether or not to allow a parallel worker to process the index at that time. What do you think? I’m not sure we need changes but I think it’s worth discussing here. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Mar 23, 2021 at 8:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > By this patch series, btree indexes became like hash indexes in terms > of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the > two cases: metapage upgrading and more than 5% > deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do > we want to disable parallel index cleanup for btree indexes like hash > indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from > amparallelvacuumoptions. My recent "Recycle nbtree pages deleted during same VACUUM" commit improved the efficiency of recycling, but I still think that it was a bit of a hack. Or at least it didn't go far enough in fixing the old design, which is itself a bit of a hack. As I said back on February 15, a truly good design for nbtree page deletion + recycling would have crash safety built in. If page deletion itself is crash safe, it really makes sense to make everything crash safe (especially because we're managing large chunks of equisized free space, unlike in heapam). And as I also said back then, a 100% crash-safe design could naturally shift the problem of nbtree page recycle safety from the producer/VACUUM side, to the consumer/_bt_getbuf() side. It should be completely separated from when VACUUM runs, and what VACUUM can discover about recycle safety in passing, at the end. That approach would completely eliminate the need to do any work in btvacuumcleanup(), which would make it natural to remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from nbtree -- the implementation of btvacuumcleanup() would just look like hashvacuumcleanup() does now -- it could do practically nothing, making this 100% okay. For now I have my doubts that it is appropriate to make this change. It seems as if the question of whether or not VACUUM_OPTION_PARALLEL_COND_CLEANUP should be used is basically the same question as "Does the vacuumcleanup() callback for this index AM look exactly like hashvacuumcleanup()?". > IMO we can live with the current > configuration just in case where the user runs into such rare > situations (especially for the latter case). In most cases, parallel > vacuum workers for index cleanup might exit with no-op but the > side-effect (wasting resources and overhead etc) would not be big. If > we want to enable it only in particular cases, we would need to have > another way for index AM to tell lazy vacuum whether or not to allow a > parallel worker to process the index at that time. What do you think? I am concerned about unintended consequences, like never noticing that we should really recycle known deleted pages not yet placed in the FSM (it's hard to think through very rare cases like this with confidence). Is it really so bad if we launch parallel workers that we don't really need for a parallel VACUUM? -- Peter Geoghegan
On Wed, Mar 24, 2021 at 12:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Mar 23, 2021 at 8:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > By this patch series, btree indexes became like hash indexes in terms > > of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the > > two cases: metapage upgrading and more than 5% > > deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do > > we want to disable parallel index cleanup for btree indexes like hash > > indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from > > amparallelvacuumoptions. > > My recent "Recycle nbtree pages deleted during same VACUUM" commit > improved the efficiency of recycling, but I still think that it was a > bit of a hack. Or at least it didn't go far enough in fixing the old > design, which is itself a bit of a hack. > > As I said back on February 15, a truly good design for nbtree page > deletion + recycling would have crash safety built in. If page > deletion itself is crash safe, it really makes sense to make > everything crash safe (especially because we're managing large chunks > of equisized free space, unlike in heapam). And as I also said back > then, a 100% crash-safe design could naturally shift the problem of > nbtree page recycle safety from the producer/VACUUM side, to the > consumer/_bt_getbuf() side. It should be completely separated from > when VACUUM runs, and what VACUUM can discover about recycle safety in > passing, at the end. > > That approach would completely eliminate the need to do any work in > btvacuumcleanup(), which would make it natural to remove > VACUUM_OPTION_PARALLEL_COND_CLEANUP from nbtree -- the implementation > of btvacuumcleanup() would just look like hashvacuumcleanup() does now > -- it could do practically nothing, making this 100% okay. > > For now I have my doubts that it is appropriate to make this change. > It seems as if the question of whether or not > VACUUM_OPTION_PARALLEL_COND_CLEANUP should be used is basically the > same question as "Does the vacuumcleanup() callback for this index AM > look exactly like hashvacuumcleanup()?". > > > IMO we can live with the current > > configuration just in case where the user runs into such rare > > situations (especially for the latter case). In most cases, parallel > > vacuum workers for index cleanup might exit with no-op but the > > side-effect (wasting resources and overhead etc) would not be big. If > > we want to enable it only in particular cases, we would need to have > > another way for index AM to tell lazy vacuum whether or not to allow a > > parallel worker to process the index at that time. What do you think? > > I am concerned about unintended consequences, like never noticing that > we should really recycle known deleted pages not yet placed in the FSM > (it's hard to think through very rare cases like this with > confidence). Is it really so bad if we launch parallel workers that we > don't really need for a parallel VACUUM? I don't think it's too bad even if we launch parallel workers for indexes that don’t really need to be processed by parallel workers. Parallel workers exit immediately after all indexes are vacuumed so it would not affect other parallel operations. There is nothing change in terms of in terms of DSM usage since btree indexes support parallel bulkdelete. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/