Thread: 64-bit XIDs in deleted nbtree pages

64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2].

Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.

Of course it will still be possible for the FSM to leak deleted nbtree
index pages with the patch -- in general the FSM isn't crash safe.
That isn't so bad with the patch, though, because a subsequent VACUUM
will eventually notice the really old deleted pages, and add them back
to the FSM once again. This will always happen because
VACUUM/_bt_getbuf()/_bt_page_recyclable() can no longer become
confused about the age of deleted pages, even when they're really old.

The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.

Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:

"38 index pages have been deleted, 38 are currently reusable."

With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:

"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."

This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.

In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).

Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.

[1] https://postgr.es/m/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com
[2] https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The first patch teaches nbtree to use 64-bit transaction IDs here, and
> so makes it impossible to leak deleted nbtree pages. This patch is the
> nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> due to exactly the same set of problems.

There is an unresolved question for my deleted page XID patch: what
should it do about the vacuum_cleanup_index_scale_factor feature,
which added an XID to the metapage (its btm_oldest_btpo_xact field). I
refer to the work done by commit 857f9c36cda for Postgres 11 by
Masahiko. It would be good to get your opinion on this as the original
author of that feature, Masahiko.

To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
all deleted pages in the index, so clearly it needs to be carefully
considered in my patch to make the XIDs 64-bit. Even still, v1 of my
patch from today more or less ignores the issue -- it just gets a
32-bit version of the oldest value as determined by the oldestBtpoXact
XID tracking stuff (which is largely unchanged, except that it works
with 64-bit Full Transaction Ids now).

Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
field to wrap around in v1 of my patch. The obvious thing to do here
is to add a new epoch metapage field, effectively making
btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
idea. The only reason that we have the btm_oldest_btpo_xact field in
the first place is to ameliorate the problem that the patch
comprehensively solves! We should stop storing *any* XIDs in the
metapage. (Besides, adding a new "epoch" field to the metapage would
be relatively messy.)

Here is a plan that allows us to stop storing any kind of XID in the
metapage in all cases:

1. Stop maintaining the oldest XID among all deleted pages in the
entire nbtree index during VACUUM. So we can remove all of the
BTVacState.oldestBtpoXact XID tracking stuff, which is currently
something that even _bt_pagedel() needs special handling for.

2. Stop considering the btm_oldest_btpo_xact metapage field in
_bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
cares about maintaining reasonably accurate statistics for the index.
Which is really how the vacuum_cleanup_index_scale_factor feature was
intended to work all along, anyway -- ISTM that the oldestBtpoXact
stuff was always just an afterthought to paper-over this annoying
32-bit XID issue.

3. We cannot actually remove the btm_oldest_btpo_xact XID field from
the metapage, because of course that would change the BTMetaPageData
struct layout, which breaks on-disk compatibility. But why not use it
for something useful instead? _bt_update_meta_cleanup_info() can use
the same field to store the number of "newly deleted" pages from the
last btbulkdelete() instead. (See my email from earlier for the
definition of "newly deleted".)

4. Now _bt_vacuum_needs_cleanup() can once again consider the
btm_oldest_btpo_xact metapage field -- except in a totally different
way, because now it means something totally different: "newly deleted
pages during last btbulkdelete() call" (per item 3). If this # pages
is very high then we probably should do a full call to btvacuumscan()
-- _bt_vacuum_needs_cleanup() will return true to make that happen.

It's unlikely but still possible that a high number of "newly deleted
pages during the last btbulkdelete() call" is in itself a good enough
reason to do a full btvacuumscan() call when the question of calling
btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
here conservatively covers that. Maybe the 32-bit-XID-in-metapage
triggering condition had some non-obvious value due to a natural
tendency for it to limit the number of deleted pages that go
unrecycled for a long time. (Or maybe there never really was any such
natural tendency -- still seems like a good idea to make the change
described by item 4.)

Even though we are conservative (at least in this sense I just
described), we nevertheless don't actually care about very old deleted
pages that we have not yet recycled -- provided there are not very
many of them. I'm thinking of "~2% of index" as the new "newly deleted
during last btbulkdelete() call" threshold applied within
_bt_vacuum_needs_cleanup(). There is no good reason why older
deleted-but-not-yet-recycled pages should be considered more valuable
than any other page that can be used when there is a page split.

Observations about on-disk compatibility with my patch + this 4 point scheme:

A. It doesn't matter that pg_upgrade'd indexes will have an XID value
in btm_oldest_btpo_xact that now gets incorrectly interpreted as
"newly deleted pages during last btbulkdelete() call" under the 4
point scheme I just outlined.

The spurious value will get cleaned up on the next VACUUM anyway
(whether VACUUM goes through btbulkdelete() or through
btvacuumcleanup()). Besides, most indexes always have a
btm_oldest_btpo_xact value of 0.

B. The patch I posted earlier doesn't actually care about the
BTREE_VERSION of the index at all. And neither does any of the stuff I
just described for a future v2 of my patch.

All indexes can use the new format for deleted pages. On-disk
compatibility is easy here because the contents of deleted pages only
need to work as a tombstone. We can safely assume that old-format
deleted pages (pre-Postgres 14 format deleted pages) must be safe to
recycle, because the pg_upgrade itself restarts Postgres. There can be
no backends that have dangling references to the old-format deleted
page.

C. All supported nbtree versions (all nbtree versions
BTREE_MIN_VERSION+) get the same benefits under this scheme.

Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to
BTREE_NOVAC_VERSION/version 3 indexes via a call to
_bt_upgrademetapage() -- that has been the case since BTREE_VERSION
was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's
vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will
have the btm_oldest_btpo_xact metapage field that I now propose to
reuse to track "newly deleted pages during last btbulkdelete() call",
per point 4.

In summary: There are no special cases here. No BTREE_VERSION related
difficulties. That seems like a huge advantage to me.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Tue, Feb 9, 2021 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Here is a plan that allows us to stop storing any kind of XID in the
> metapage in all cases:

Attached is v2, which deals with the metapage 32-bit
XID/btm_oldest_btpo_xact issue using the approach I described earlier.
We don't store an XID in the metapage anymore in v2. This seems to
work well, as I expected it would.

-- 
Peter Geoghegan

Attachment

Re: 64-bit XIDs in deleted nbtree pages

From
Heikki Linnakangas
Date:
On 10/02/2021 00:14, Peter Geoghegan wrote:
> There is a long standing problem with the way that nbtree page
> deletion places deleted pages in the FSM for recycling: The use of a
> 32-bit XID within the deleted page (in the special
> area's/BTPageOpaqueData struct's btpo.xact field) is not robust
> against XID wraparound, which can lead to permanently leaking pages in
> a variety of scenarios. The problems became worse with the addition of
> the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
> in this context creates risk for any further improvements in VACUUM
> that similarly involve skipping whole indexes. For example, Masahiko
> has been working on a patch that teaches VACUUM to skip indexes that
> are known to have very little garbage [2].
> 
> Attached patch series fixes the issue once and for all. This is
> something that I'm targeting for Postgres 14, since it's more or less
> a bug fix.

Thanks for picking this up!

> The first patch teaches nbtree to use 64-bit transaction IDs here, and
> so makes it impossible to leak deleted nbtree pages. This patch is the
> nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> due to exactly the same set of problems. The first patch also makes
> the level field stored in nbtree page's special area/BTPageOpaqueData
> reliably store the level, even in a deleted page. This allows me to
> consistently use the level field within amcheck, including even within
> deleted pages.

Is it really worth the trouble to maintain 'level' on deleted pages? All 
you currently do with it is check that the BTP_LEAF flag is set iff 
"level == 0", which seems pointless. I guess there could be some 
forensic value in keeping 'level', but meh.

> The second patch in the series adds new information to VACUUM VERBOSE.
> This makes it easy to understand what's going on here. Index page
> deletion related output becomes more useful. It might also help with
> debugging the first patch.
> 
> Currently, VACUUM VERBOSE output for an index that has some page
> deletions looks like this:
> 
> "38 index pages have been deleted, 38 are currently reusable."
> 
> With the second patch applied, we might see this output at the same
> point in VACUUM VERBOSE output instead:
> 
> "38 index pages have been deleted, 0 are newly deleted, 38 are
> currently reusable."
> 
> This means that out of the 38 of the pages that were found to be
> marked deleted in the index, 0 were deleted by the VACUUM operation
> whose output we see here. That is, there were 0 nbtree pages that were
> newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
> *this particular* VACUUM -- the VACUUM operation that we see
> instrumentation about here. It follows that the 38 deleted pages that
> we encountered must have been marked BTP_DELETED by some previous
> VACUUM operation.
> 
> In practice the "%u are currently reusable" output should never
> include newly deleted pages, since there is no way that a page marked
> BTP_DELETED can be put in the FSM during the same VACUUM operation --
> that's unsafe (we need all of this recycling/XID indirection precisely
> because we need to delay recycling until it is truly safe, of course).
> Note that the "%u index pages have been deleted" output includes both
> pages deleted by some previous VACUUM operation, and newly deleted
> pages (no change there).
> 
> Note that the new "newly deleted" output is instrumentation about this
> particular *VACUUM operation*. In contrast, the other two existing
> output numbers ("deleted" and "currently reusable") are actually
> instrumentation about the state of the *index as a whole* at a point
> in time (barring concurrent recycling of pages counted in VACUUM by
> some random _bt_getbuf() call in another backend). This fundamental
> distinction is important here. All 3 numbers/stats that we output can
> have different values, which can be used to debug the first patch. You
> can directly observe uncommon cases just from the VERBOSE output, like
> when a long running transaction holds up recycling of a deleted page
> that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
> And so if the first patch had any bugs, there'd be a pretty good
> chance that you could observe them using multiple VACUUM VERBOSE
> operations -- you might notice something inconsistent or contradictory
> just by examining the output over time, how things change, etc.

The full message on master is:

INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.

How about:

INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  250000 index row versions and 686 pages were removed.
2056 index pages are now unused, 1370 are currently reusable.

The idea is that the first DETAIL line now says what the VACUUM did this 
round, and the last line says what the state of the index is now. One 
concern with that phrasing is that it might not be clear what "686 pages 
were removed" means. We don't actually shrink the file. Then again, I'm 
not sure if the "have been deleted" was any better in that regard.

It's still a bit weird that the "what VACUUM did this round" information 
is sandwiched between the two other lines that talk about the state of 
the index after the operation. But I think the language now makes it 
more clear which is which. Or perhaps flip the INFO and first DETAIL 
lines around like this:

INFO: 250000 index row versions and 686 pages were removed from index 
"foo_pkey"
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

For context, the more full message you get on master is:

postgres=# vacuum verbose foo;
INFO:  vacuuming "public.foo"
INFO:  scanned index "foo_pkey" to remove 250000 row versions
DETAIL:  CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
INFO:  "foo": removed 250000 row versions in 1107 pages
DETAIL:  CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "foo": found 250000 removable, 271 nonremovable row versions in 
1108 out of 4425 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1164
There were 87 unused item identifiers.
Skipped 0 pages due to buffer pins, 2212 frozen pages.
0 pages are entirely empty.
CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.28 s.
VACUUM

That's pretty confusing, it's a mix of basically progress indicators 
(vacuuming "public.foo"), CPU measurements, information about what was 
removed, and what the state is afterwards. Would be nice to make that 
more clear overall. But for now, for this particular INFO message, 
perhaps make it more consistent with the lines printed by heapam, like this:

INFO: "foo_pkey": removed 250000 index row versions and 686 pages
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

- Heikki



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Wed, Feb 10, 2021 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The first patch teaches nbtree to use 64-bit transaction IDs here, and
> > so makes it impossible to leak deleted nbtree pages. This patch is the
> > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> > due to exactly the same set of problems.

Thank you for working on this!

>
> There is an unresolved question for my deleted page XID patch: what
> should it do about the vacuum_cleanup_index_scale_factor feature,
> which added an XID to the metapage (its btm_oldest_btpo_xact field). I
> refer to the work done by commit 857f9c36cda for Postgres 11 by
> Masahiko. It would be good to get your opinion on this as the original
> author of that feature, Masahiko.
>
> To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
> all deleted pages in the index, so clearly it needs to be carefully
> considered in my patch to make the XIDs 64-bit. Even still, v1 of my
> patch from today more or less ignores the issue -- it just gets a
> 32-bit version of the oldest value as determined by the oldestBtpoXact
> XID tracking stuff (which is largely unchanged, except that it works
> with 64-bit Full Transaction Ids now).
>
> Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
> field to wrap around in v1 of my patch. The obvious thing to do here
> is to add a new epoch metapage field, effectively making
> btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
> idea. The only reason that we have the btm_oldest_btpo_xact field in
> the first place is to ameliorate the problem that the patch
> comprehensively solves! We should stop storing *any* XIDs in the
> metapage. (Besides, adding a new "epoch" field to the metapage would
> be relatively messy.)

I agree that btm_oldest_btpo_xact will no longer be necessary in terms
of recycling deleted pages.

The purpose of btm_oldest_btpo_xact is to prevent deleted pages from
being leaked. As you mentioned, it has the oldest btpo.xact in
BTPageOpaqueData among all deleted pages in the index. Looking back to
the time when we develop INDEX_CLEANUP option if we skip index cleanup
(meaning both ambulkdelete and amvaucumcleanup), there was a problem
in btree indexes that deleted pages could never be recycled if XID
wraps around. So the idea behind btm_oldest_btpo_xact is, we remember
the oldest btpo.xact among the all deleted pages and do btvacuumscan()
if this value is older than global xmin (meaning there is at least one
recyclable page). That way, we can recycle the deleted pages without
leaking the pages (of course, unless INDEX_CLEANUP is disabled).

Given that we can guarantee that deleted pages never be leaked by
using 64-bit XID, I also think we don't need this value. We can do
amvacuumcleanup only if the table receives enough insertions to update
the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
think this is a more desirable behavior. Not skipping amvacuumcleanup
if there is even one deleted page that we can recycle is very
conservative.

Considering your idea of keeping newly deleted pages in the meta page,
I can see a little value that keeping btm_oldest_btpo_xact and making
it 64-bit XID. I described below.

>
> Here is a plan that allows us to stop storing any kind of XID in the
> metapage in all cases:
>
> 1. Stop maintaining the oldest XID among all deleted pages in the
> entire nbtree index during VACUUM. So we can remove all of the
> BTVacState.oldestBtpoXact XID tracking stuff, which is currently
> something that even _bt_pagedel() needs special handling for.
>
> 2. Stop considering the btm_oldest_btpo_xact metapage field in
> _bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
> cares about maintaining reasonably accurate statistics for the index.
> Which is really how the vacuum_cleanup_index_scale_factor feature was
> intended to work all along, anyway -- ISTM that the oldestBtpoXact
> stuff was always just an afterthought to paper-over this annoying
> 32-bit XID issue.
>
> 3. We cannot actually remove the btm_oldest_btpo_xact XID field from
> the metapage, because of course that would change the BTMetaPageData
> struct layout, which breaks on-disk compatibility. But why not use it
> for something useful instead? _bt_update_meta_cleanup_info() can use
> the same field to store the number of "newly deleted" pages from the
> last btbulkdelete() instead. (See my email from earlier for the
> definition of "newly deleted".)
>
> 4. Now _bt_vacuum_needs_cleanup() can once again consider the
> btm_oldest_btpo_xact metapage field -- except in a totally different
> way, because now it means something totally different: "newly deleted
> pages during last btbulkdelete() call" (per item 3). If this # pages
> is very high then we probably should do a full call to btvacuumscan()
> -- _bt_vacuum_needs_cleanup() will return true to make that happen.
>
> It's unlikely but still possible that a high number of "newly deleted
> pages during the last btbulkdelete() call" is in itself a good enough
> reason to do a full btvacuumscan() call when the question of calling
> btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
> here conservatively covers that. Maybe the 32-bit-XID-in-metapage
> triggering condition had some non-obvious value due to a natural
> tendency for it to limit the number of deleted pages that go
> unrecycled for a long time. (Or maybe there never really was any such
> natural tendency -- still seems like a good idea to make the change
> described by item 4.)
>
> Even though we are conservative (at least in this sense I just
> described), we nevertheless don't actually care about very old deleted
> pages that we have not yet recycled -- provided there are not very
> many of them. I'm thinking of "~2% of index" as the new "newly deleted
> during last btbulkdelete() call" threshold applied within
> _bt_vacuum_needs_cleanup(). There is no good reason why older
> deleted-but-not-yet-recycled pages should be considered more valuable
> than any other page that can be used when there is a page split.

Interesting.

I like this idea that triggers btvacuumscan() if there are many newly
deleted pages. I think this would be helpful especially for the case
of bulk-deletion on the table. But why we use the number of *newly*
deleted pages but not the total number of deleted pages in the index?
IIUC if several btbulkdelete executions deleted index pages less than
2% of the index and those deleted pages could not be recycled yet,
then the number of recyclable pages would exceed 2% of the index in
total but amvacuumcleanup() would not trigger btvacuumscan() because
the last newly deleted pages are less than the 2% threshold. I might
be missing something though.

Also, we need to note that having newly deleted pages doesn't
necessarily mean these always are recyclable at that time. If the
global xmin is still older than deleted page's btpo.xact values, we
still could not recycle them. I think btm_oldest_btpo_xact probably
will help this case. That is, we store the oldest btpo.xact among
those newly deleted pages to btm_oldest_btpo_xact and we trigger
btvacuumscan() if there are many newly deleted pages (more than 2% of
index) and the btm_oldest_btpo_xact is older than the global xmin (I
suppose each newly deleted pages could have different btpo.xact).

>
> Observations about on-disk compatibility with my patch + this 4 point scheme:
>
> A. It doesn't matter that pg_upgrade'd indexes will have an XID value
> in btm_oldest_btpo_xact that now gets incorrectly interpreted as
> "newly deleted pages during last btbulkdelete() call" under the 4
> point scheme I just outlined.
>
> The spurious value will get cleaned up on the next VACUUM anyway
> (whether VACUUM goes through btbulkdelete() or through
> btvacuumcleanup()). Besides, most indexes always have a
> btm_oldest_btpo_xact value of 0.
>
> B. The patch I posted earlier doesn't actually care about the
> BTREE_VERSION of the index at all. And neither does any of the stuff I
> just described for a future v2 of my patch.
>
> All indexes can use the new format for deleted pages. On-disk
> compatibility is easy here because the contents of deleted pages only
> need to work as a tombstone. We can safely assume that old-format
> deleted pages (pre-Postgres 14 format deleted pages) must be safe to
> recycle, because the pg_upgrade itself restarts Postgres. There can be
> no backends that have dangling references to the old-format deleted
> page.
>
> C. All supported nbtree versions (all nbtree versions
> BTREE_MIN_VERSION+) get the same benefits under this scheme.
>
> Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to
> BTREE_NOVAC_VERSION/version 3 indexes via a call to
> _bt_upgrademetapage() -- that has been the case since BTREE_VERSION
> was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's
> vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will
> have the btm_oldest_btpo_xact metapage field that I now propose to
> reuse to track "newly deleted pages during last btbulkdelete() call",
> per point 4.
>
> In summary: There are no special cases here. No BTREE_VERSION related
> difficulties. That seems like a huge advantage to me.

Great! I'll look at the v2 patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Tue, Feb 9, 2021 at 11:58 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Thanks for picking this up!

I actually had a patch for this in 2019, albeit one that remained in
rough shape until recently. Must have forgotten about it.

> Is it really worth the trouble to maintain 'level' on deleted pages? All
> you currently do with it is check that the BTP_LEAF flag is set iff
> "level == 0", which seems pointless. I guess there could be some
> forensic value in keeping 'level', but meh.

What trouble is that? The only way in which it's inconvenient is that
we have to include the level field in xl_btree_unlink_page WAL records
for the first time. The structure of the relevant REDO routine (which
is called btree_xlog_unlink_page()) ought to explicitly recreate the
original page from scratch, without any special cases. This makes it
possible to pretend that there never was such a thing as an nbtree
page whose level field could not be relied on. I personally think that
it's simpler when seen in the wider context of how the code works and
is verified.

Besides, there is also amcheck to consider. I am a big believer in
amcheck, and see it as something that has enabled my work on the
B-Tree code over the past few years. Preserving the level field in
deleted pages increases our coverage just a little, and practically
eliminates cases where we cannot rely on the level field.

Of course it's still true that this detail (the deleted pages level
field question) will probably never seem important to anybody else. To
me it's one small detail of a broader strategy. No one detail of that
broader strategy, taken in isolation, will ever be crucially
important.

Of course it's also true that we should not assume that a very high
cost in performance/code/whatever can justify a much smaller benefit
in amcheck. But you haven't really explained why the cost seems
unacceptable to you. (Perhaps I missed something.)

> How about:
>
> INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
> DETAIL:  250000 index row versions and 686 pages were removed.
> 2056 index pages are now unused, 1370 are currently reusable.
>
> The idea is that the first DETAIL line now says what the VACUUM did this
> round, and the last line says what the state of the index is now. One
> concern with that phrasing is that it might not be clear what "686 pages
> were removed" means.

> It's still a bit weird that the "what VACUUM did this round" information
> is sandwiched between the two other lines that talk about the state of
> the index after the operation. But I think the language now makes it
> more clear which is which.

IMV our immediate goal for the new VACUUM VERBOSE output should be to
make the output as accurate and descriptive as possible (while still
using terminology that works for all index AMs, not just nbtree). I
don't think that we should give too much weight to making the
information easy to understand in isolation. Because that's basically
impossible -- it just doesn't work that way IME.

Confusion over the accounting of "deleted pages in indexes" vs "pages
deleted by this VACUUM" is not new. See my bugfix commit 73a076b0 to
see one vintage example. The relevant output of VACUUM VERBOSE
produced inconsistent results for perhaps as long as 15 years before I
noticed it and fixed it. I somehow didn't notice this despite using it
for various tests for my own B-Tree projects a year or two before the
fix. Tests that produced inconsistent results that I noticed pretty
early on, and yet assumed were all down to some subtlety that I didn't
yet understand.

My point is this: I am quite prepared to admit that these details
really are complicated. But that's not incidental to what's really
going on, or anything (though I agree with your later remarks on the
general tidiness of VACUUM VERBOSE -- it is a real dog's dinner).

I'm not saying that we should assume that no DBA will find the
relevant VACUUM VERBOSE output useful -- I don't think that at all. It
will be kind of rare for a user to really comb through it. But that's
mostly because big problems in this area are themselves kind of rare
(most individual indexes never have any deleted pages IME).

Any DBA consuming this output sensibly will consume it in a way that
makes sense in the *context of the problem that they're experiencing*,
whatever that might mean for them. They'll consider how it changes
over time for the same index. They'll try to correlate it with other
symptoms, or other problems, and make sense of it in a top-down
fashion. We should try to make it as descriptive as possible so that
DBAs will have the breadcrumbs they need to tie it back to whatever
the core issue happens to be -- maybe they'll have to read the source
code to get to the bottom of it. It's likely to be some rare issue in
those cases where the DBA really cares about the details -- it's
likely to be workload dependent.

Good DBAs spend much of their time on exceptional problems -- all the
easy problems will have been automated away already. Things like wait
events are popular with DBAs for this reason.

> Or perhaps flip the INFO and first DETAIL
> lines around like this:

> INFO: 250000 index row versions and 686 pages were removed from index
> "foo_pkey"
> DETAIL: index now contains 250001 row versions in 2745 pages.
> 2056 index pages are now unused, of which 1370 are currently reusable.
>
> For context, the more full message you get on master is:

> That's pretty confusing, it's a mix of basically progress indicators
> (vacuuming "public.foo"), CPU measurements, information about what was
> removed, and what the state is afterwards.

I agree that the output of VACUUM VERBOSE is messy. It's probably a
bunch of accretions that made sense in isolation, but added up to a
big mess over time. So I agree: now would be a good time to do
something about that.

It would also be nice to find a way to get this information in the
logs when log_autovacuum is enabled (perhaps only when the verbosity
is increased). I've discussed this with Masahiko in the context of his
recent work, actually. Even before we started talking about the XID
page deletion problem that I'm fixing here.

> INFO: "foo_pkey": removed 250000 index row versions and 686 pages
> DETAIL: index now contains 250001 row versions in 2745 pages.
> 2056 index pages are now unused, of which 1370 are currently reusable.

I can see what you mean here, and maybe we should do roughly what
you've outlined. Still, we should use terminology that isn't too far
removed from what actually happens in nbtree. What's a "removed" page?
The distinction between all of the different kinds of index pages that
might be involved here is just subtle. Again, better to use a precise,
descriptive term that nobody fully understands -- because hardly
anybody will fully understand it anyway (even including advanced users
that go on to find the VACUUM VERBOSE output very useful for whatever
reason).

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for working on this!

I'm glad that I finally found time for it! It seems like it'll make
things easier elsewhere.

Attached is v3 of the index. I'll describe the changes I made in more
detail in my response to your points below.

> I agree that btm_oldest_btpo_xact will no longer be necessary in terms
> of recycling deleted pages.

Cool.

> Given that we can guarantee that deleted pages never be leaked by
> using 64-bit XID, I also think we don't need this value. We can do
> amvacuumcleanup only if the table receives enough insertions to update
> the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
> think this is a more desirable behavior. Not skipping amvacuumcleanup
> if there is even one deleted page that we can recycle is very
> conservative.
>
> Considering your idea of keeping newly deleted pages in the meta page,
> I can see a little value that keeping btm_oldest_btpo_xact and making
> it 64-bit XID. I described below.

> Interesting.
>
> I like this idea that triggers btvacuumscan() if there are many newly
> deleted pages. I think this would be helpful especially for the case
> of bulk-deletion on the table. But why we use the number of *newly*
> deleted pages but not the total number of deleted pages in the index?

I was unclear here -- I should not have said "newly deleted" pages at
all. What I actually do when calling _bt_vacuum_needs_cleanup() is
this (from v3, at the end of btvacuumscan()):

-   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+   Assert(stats->pages_deleted >= stats->pages_free);
+   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
                                 info->num_heap_tuples);

We're actually passing something I have called
"pages_deleted_not_free" here, which is derived from the bulk delete
stats in the obvious way that you see here (subtraction). I'm not
using pages_newly_deleted at all now. Note also that the behavior
inside _bt_update_meta_cleanup_info() no longer varies based on
whether it is called during btvacuumcleanup() or during btbulkdelete()
-- the same rules apply either way. We want to store
pages_deleted_not_free in the metapage at the end of btvacuumscan(),
no matter what.

This same pages_deleted_not_free information is now used by
_bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
high (over 2.5%), then that will trigger a call to btbulkdelete() (we
won't skip scanning the index). Though in practice it probably won't
come up that often -- there just aren't ever that many deleted pages
in most indexes.

> IIUC if several btbulkdelete executions deleted index pages less than
> 2% of the index and those deleted pages could not be recycled yet,
> then the number of recyclable pages would exceed 2% of the index in
> total but amvacuumcleanup() would not trigger btvacuumscan() because
> the last newly deleted pages are less than the 2% threshold. I might
> be missing something though.

I think you're right -- my idea of varying the behavior of
_bt_update_meta_cleanup_info() based on whether it's being called
during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
half the problem was that I explained the idea badly to begin with).
But, as I said, it's fixed in v3: we simply pass
"pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
now.

Does that make sense? Does it address this concern?

> Also, we need to note that having newly deleted pages doesn't
> necessarily mean these always are recyclable at that time. If the
> global xmin is still older than deleted page's btpo.xact values, we
> still could not recycle them. I think btm_oldest_btpo_xact probably
> will help this case. That is, we store the oldest btpo.xact among
> those newly deleted pages to btm_oldest_btpo_xact and we trigger
> btvacuumscan() if there are many newly deleted pages (more than 2% of
> index) and the btm_oldest_btpo_xact is older than the global xmin (I
> suppose each newly deleted pages could have different btpo.xact).

I agree that having no XID in the metapage creates a new small
problem. Specifically, there are certain narrow cases that can cause
confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
exist before my patch (kind of).

The simplest example is easy to run into when debugging the patch on
your laptop. Because you're using your personal laptop, and not a real
production server, there will be no concurrent sessions that might
consume XIDs. You can run VACUUM VERBOSE manually several times, but
that alone will never be enough to enable VACUUM to recycle any of the
pages that the first VACUUM manages to delete (many to mark deleted,
reporting the pages as "newly deleted" via the new instrumentation
from the second patch). Note that the master branch is *also* unable
to recycle these deleted pages, simply because the "safe xid" never
gets old because there are no newly allocated XIDs to make it look old
(there are no allocated XIDs just because nothing else happens). That
in itself is not the new problem.

The new problem is that _bt_vacuum_needs_cleanup() will no longer
notice that the oldest XID among deleted-but-not-yet-recycled pages is
so old that it will not be able to recycle the pages anyway -- at
least not the oldest page, though in this specific case that will
apply to all deleted pages equally. We might as well not bother trying
yet, which the old code "gets right" -- but it doesn't get it right
for any good reason. That is, the old code won't have VACUUM scan the
index at all, so it "wins" in this specific scenario.

I think that's okay, though -- it's not a real problem, and actually
makes sense and has other advantages. This is why I believe it's okay:

* We really should never VACUUM the same table before even one or two
XIDs are allocated -- that's what happens in the simple laptop test
scenario that I described. Surely we should not be too concerned about
"doing the right thing" under this totally artificial set of
conditions.

(BTW, I've been using txid_current() for my own "laptop testing", as a
way to work around this issue.)

* More generally, if you really can't do recycling of pages that you
deleted during the last VACUUM during this VACUUM (perhaps because of
the presence of a long-running xact that holds open a snapshot), then
you have lots of *huge* problems already, and this is the least of
your concerns. Besides, at that point an affected VACUUM will be doing
work for an affected index through a btbulkdelete() call, so the
behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.

* As you pointed out already, the oldest XID/deleted page from the
index may be significantly older than the newest. Why should we bucket
them together?

We could easily have a case where most of the deleted pages can be
recycled -- even when all indexes were originally marked deleted by
the same VACUUM operation. If there are lots of pages that actually
can be recycled, it is probably a bad thing to assume that the oldest
XID is representative of all of them. After all, with the patch we
only go out of our way to recycle deleted pages when we are almost
sure that the total number of recyclable pages (pages marked deleted
during a previous VACUUM) exceeds 2.5% of the total size of the index.
That broad constraint is important here -- if we do nothing unless
there are lots of deleted pages anyway, we are highly unlikely to ever
err on the side of being too eager (not eager enough seems more likely
to me).

I think that we're justified in making a general assumption inside
_bt_vacuum_needs_cleanup() (which is documented at the point that we
call it, inside btvacuumscan()): The assumption that however many
index pages the metapage says we'll be able to recycle (whatever the
field says) will in fact turn out to be recyclable if we decide that
we need to. There are specific cases where that will be kind of wrong,
as I've gone into, but the assumption/design has many more advantages
than disadvantages.

I have tried to capture this in v3 of the patch. Can you take a look?
See the new comments inside _bt_vacuum_needs_cleanup(). Plus the
comments when we call it inside btvacuumscan().

Do you think that those new comments are helpful? Does this address
your concern?

Thanks
-- 
Peter Geoghegan

Attachment

Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Wed, Feb 10, 2021 at 7:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v3 of the index. I'll describe the changes I made in more
> detail in my response to your points below.

I forget to mention that v3 adds several assertions like this one:

Assert(!_bt_page_recyclable(BufferGetPage(buf)));

These appear at a few key points inside generic routines like
_bt_getbuf(). The overall effect is that every nbtree buffer access
(with the exception of buffer accesses by VACUUM) will make sure that
the page that they're about to access is not recyclable (a page that
an index scan lands on might be half-dead or deleted, but it had
better not be recyclable).

This can probably catch problems with recycling pages too early, such
as the problem fixed by commit d3abbbeb back in 2012. Any similar bugs
in this area that may appear in the future can be expected to be very
subtle, for a few reasons. For one, a page can be recyclable but not
yet entered into the FSM by VACUUM for a long time. (I could go on.)

The assertions dramatically improve our chances of catching problems
like that early.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Thu, Feb 11, 2021 at 12:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Thank you for working on this!
>
> I'm glad that I finally found time for it! It seems like it'll make
> things easier elsewhere.
>
> Attached is v3 of the index. I'll describe the changes I made in more
> detail in my response to your points below.
>
> > I agree that btm_oldest_btpo_xact will no longer be necessary in terms
> > of recycling deleted pages.
>
> Cool.
>
> > Given that we can guarantee that deleted pages never be leaked by
> > using 64-bit XID, I also think we don't need this value. We can do
> > amvacuumcleanup only if the table receives enough insertions to update
> > the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
> > think this is a more desirable behavior. Not skipping amvacuumcleanup
> > if there is even one deleted page that we can recycle is very
> > conservative.
> >
> > Considering your idea of keeping newly deleted pages in the meta page,
> > I can see a little value that keeping btm_oldest_btpo_xact and making
> > it 64-bit XID. I described below.
>
> > Interesting.
> >
> > I like this idea that triggers btvacuumscan() if there are many newly
> > deleted pages. I think this would be helpful especially for the case
> > of bulk-deletion on the table. But why we use the number of *newly*
> > deleted pages but not the total number of deleted pages in the index?
>
> I was unclear here -- I should not have said "newly deleted" pages at
> all. What I actually do when calling _bt_vacuum_needs_cleanup() is
> this (from v3, at the end of btvacuumscan()):
>
> -   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
> +   Assert(stats->pages_deleted >= stats->pages_free);
> +   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
> +   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
>                                  info->num_heap_tuples);
>
> We're actually passing something I have called
> "pages_deleted_not_free" here, which is derived from the bulk delete
> stats in the obvious way that you see here (subtraction). I'm not
> using pages_newly_deleted at all now. Note also that the behavior
> inside _bt_update_meta_cleanup_info() no longer varies based on
> whether it is called during btvacuumcleanup() or during btbulkdelete()
> -- the same rules apply either way. We want to store
> pages_deleted_not_free in the metapage at the end of btvacuumscan(),
> no matter what.
>
> This same pages_deleted_not_free information is now used by
> _bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
> high (over 2.5%), then that will trigger a call to btbulkdelete() (we
> won't skip scanning the index). Though in practice it probably won't
> come up that often -- there just aren't ever that many deleted pages
> in most indexes.

Thanks for your explanation. That makes sense to me.

>
> > IIUC if several btbulkdelete executions deleted index pages less than
> > 2% of the index and those deleted pages could not be recycled yet,
> > then the number of recyclable pages would exceed 2% of the index in
> > total but amvacuumcleanup() would not trigger btvacuumscan() because
> > the last newly deleted pages are less than the 2% threshold. I might
> > be missing something though.
>
> I think you're right -- my idea of varying the behavior of
> _bt_update_meta_cleanup_info() based on whether it's being called
> during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
> half the problem was that I explained the idea badly to begin with).
> But, as I said, it's fixed in v3: we simply pass
> "pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
> now.
>
> Does that make sense? Does it address this concern?

Yes!

>
> > Also, we need to note that having newly deleted pages doesn't
> > necessarily mean these always are recyclable at that time. If the
> > global xmin is still older than deleted page's btpo.xact values, we
> > still could not recycle them. I think btm_oldest_btpo_xact probably
> > will help this case. That is, we store the oldest btpo.xact among
> > those newly deleted pages to btm_oldest_btpo_xact and we trigger
> > btvacuumscan() if there are many newly deleted pages (more than 2% of
> > index) and the btm_oldest_btpo_xact is older than the global xmin (I
> > suppose each newly deleted pages could have different btpo.xact).
>
> I agree that having no XID in the metapage creates a new small
> problem. Specifically, there are certain narrow cases that can cause
> confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
> exist before my patch (kind of).
>
> The simplest example is easy to run into when debugging the patch on
> your laptop. Because you're using your personal laptop, and not a real
> production server, there will be no concurrent sessions that might
> consume XIDs. You can run VACUUM VERBOSE manually several times, but
> that alone will never be enough to enable VACUUM to recycle any of the
> pages that the first VACUUM manages to delete (many to mark deleted,
> reporting the pages as "newly deleted" via the new instrumentation
> from the second patch). Note that the master branch is *also* unable
> to recycle these deleted pages, simply because the "safe xid" never
> gets old because there are no newly allocated XIDs to make it look old
> (there are no allocated XIDs just because nothing else happens). That
> in itself is not the new problem.
>
> The new problem is that _bt_vacuum_needs_cleanup() will no longer
> notice that the oldest XID among deleted-but-not-yet-recycled pages is
> so old that it will not be able to recycle the pages anyway -- at
> least not the oldest page, though in this specific case that will
> apply to all deleted pages equally. We might as well not bother trying
> yet, which the old code "gets right" -- but it doesn't get it right
> for any good reason. That is, the old code won't have VACUUM scan the
> index at all, so it "wins" in this specific scenario.

I'm on the same page.

>
> I think that's okay, though -- it's not a real problem, and actually
> makes sense and has other advantages. This is why I believe it's okay:
>
> * We really should never VACUUM the same table before even one or two
> XIDs are allocated -- that's what happens in the simple laptop test
> scenario that I described. Surely we should not be too concerned about
> "doing the right thing" under this totally artificial set of
> conditions.

Right.

>
> (BTW, I've been using txid_current() for my own "laptop testing", as a
> way to work around this issue.)
>
> * More generally, if you really can't do recycling of pages that you
> deleted during the last VACUUM during this VACUUM (perhaps because of
> the presence of a long-running xact that holds open a snapshot), then
> you have lots of *huge* problems already, and this is the least of
> your concerns. Besides, at that point an affected VACUUM will be doing
> work for an affected index through a btbulkdelete() call, so the
> behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.
>

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

> * As you pointed out already, the oldest XID/deleted page from the
> index may be significantly older than the newest. Why should we bucket
> them together?

I agree with this point.

>
> We could easily have a case where most of the deleted pages can be
> recycled -- even when all indexes were originally marked deleted by
> the same VACUUM operation. If there are lots of pages that actually
> can be recycled, it is probably a bad thing to assume that the oldest
> XID is representative of all of them. After all, with the patch we
> only go out of our way to recycle deleted pages when we are almost
> sure that the total number of recyclable pages (pages marked deleted
> during a previous VACUUM) exceeds 2.5% of the total size of the index.
> That broad constraint is important here -- if we do nothing unless
> there are lots of deleted pages anyway, we are highly unlikely to ever
> err on the side of being too eager (not eager enough seems more likely
> to me).
>
> I think that we're justified in making a general assumption inside
> _bt_vacuum_needs_cleanup() (which is documented at the point that we
> call it, inside btvacuumscan()): The assumption that however many
> index pages the metapage says we'll be able to recycle (whatever the
> field says) will in fact turn out to be recyclable if we decide that
> we need to. There are specific cases where that will be kind of wrong,
> as I've gone into, but the assumption/design has many more advantages
> than disadvantages.
>
> I have tried to capture this in v3 of the patch. Can you take a look?
> See the new comments inside _bt_vacuum_needs_cleanup(). Plus the
> comments when we call it inside btvacuumscan().

I basically agreed with the change made in v3 patch. But I think it's
probably worth having a discussion on append-only table cases with
autovacuums triggered by
autovacuum_vacuum_insert_scale_factor/threshold.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I agree that there already are huge problems in that case. But I think
> we need to consider an append-only case as well; after bulk deletion
> on an append-only table, vacuum deletes heap tuples and index tuples,
> marking some index pages as dead and setting an XID into btpo.xact.
> Since we trigger autovacuums even by insertions based on
> autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> the table again. But if there is a long-running query a "wasted"
> cleanup scan could happen many times depending on the values of
> autovacuum_vacuum_insert_scale_factor/threshold and
> vacuum_cleanup_index_scale_factor. This should not happen in the old
> code. I agree this is DBA problem but it also means this could bring
> another new problem in a long-running query case.

I see your point.

This will only not be a problem with the old code because the oldest
XID in the metapage happens to restrict VACUUM in what turns out to be
exactly perfect. But why assume that? It's actually rather unlikely
that we won't be able to free even one block, even in this scenario.
The oldest XID isn't truly special -- at least not without the
restrictions that go with 32-bit XIDs.

The other thing is that vacuum_cleanup_index_scale_factor is mostly
about limiting how long we'll go before having stale statistics, and
so presumably the user gets the benefit of not having stale statistics
(maybe that theory is a bit questionable in some cases, but that
doesn't have all that much to do with page deletion -- in fact the
problem exists without page deletion ever occuring).

BTW, I am thinking about making recycling take place for pages that
were deleted during the same VACUUM. We can just use a
work_mem-limited array to remember a list of blocks that are deleted
but not yet recyclable (plus the XID found in the block). At the end
of the VACUUM, (just before calling IndexFreeSpaceMapVacuum() from
within btvacuumscan()), we can then determine which blocks are now
safe to recycle, and recycle them after all using some "late" calls to
RecordFreeIndexPage() (and without revisiting the pages a second
time). No need to wait for the next VACUUM to recycle pages this way,
at least in many common cases. The reality is that it usually doesn't
take very long for a deleted page to become recyclable -- why wait?

This idea is enabled by commit c79f6df75dd from 2018. I think it's the
next logical step.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Victor Yegorov
Date:
сб, 13 февр. 2021 г. в 05:39, Masahiko Sawada <sawada.mshk@gmail.com>:
> (BTW, I've been using txid_current() for my own "laptop testing", as a
> way to work around this issue.)
>
> * More generally, if you really can't do recycling of pages that you
> deleted during the last VACUUM during this VACUUM (perhaps because of
> the presence of a long-running xact that holds open a snapshot), then
> you have lots of *huge* problems already, and this is the least of
> your concerns. Besides, at that point an affected VACUUM will be doing
> work for an affected index through a btbulkdelete() call, so the
> behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.
>

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

I'd like to outline one relevant case.

Quite often bulk deletes are done on a time series data (oldest) and effectively
removes a continuous chunk of data at the (physical) beginning of the table,
this is especially true for the append-only tables.
After the delete, planning queries takes a long time, due to MergeJoin estimates
are using IndexScans ( see https://postgr.es/m/17467.1426090533@sss.pgh.pa.us )
Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate this.

So I would, actually, like it very much for VACUUM to kick in sooner in such cases.

--
Victor Yegorov

Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Fri, Feb 12, 2021 at 10:27 PM Victor Yegorov <vyegorov@gmail.com> wrote:
> I'd like to outline one relevant case.
>
> Quite often bulk deletes are done on a time series data (oldest) and effectively
> removes a continuous chunk of data at the (physical) beginning of the table,
> this is especially true for the append-only tables.
> After the delete, planning queries takes a long time, due to MergeJoin estimates
> are using IndexScans ( see https://postgr.es/m/17467.1426090533@sss.pgh.pa.us )
> Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate this.
>
> So I would, actually, like it very much for VACUUM to kick in sooner in such cases.

Masahiko was specifically concerned about workloads with
bursty/uneven/mixed VACUUM triggering conditions -- he mentioned
autovacuum_vacuum_insert_scale_factor/threshold as being applied to
trigger a second VACUUM (which follows from an initial VACUUM that
performs deletions following a bulk DELETE).

A VACUUM that needs to delete index tuples will do its btvacuumscan()
through the btbulkdelete() path, not through the btvacuumcleanup()
"cleanup only" path. The btbulkdelete() path won't ever call
_bt_vacuum_needs_cleanup() in the first place, and so there can be no
risk that the relevant changes (changes that the patch makes to that
function) will have some new bad effect. The problem that you have
described seems very real, but it doesn't seem relevant to the
specific scenario that Masahiko expressed concern about. Nor does it
seem relevant to this patch more generally.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I agree that there already are huge problems in that case. But I think
> > we need to consider an append-only case as well; after bulk deletion
> > on an append-only table, vacuum deletes heap tuples and index tuples,
> > marking some index pages as dead and setting an XID into btpo.xact.
> > Since we trigger autovacuums even by insertions based on
> > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> > the table again. But if there is a long-running query a "wasted"
> > cleanup scan could happen many times depending on the values of
> > autovacuum_vacuum_insert_scale_factor/threshold and
> > vacuum_cleanup_index_scale_factor. This should not happen in the old
> > code. I agree this is DBA problem but it also means this could bring
> > another new problem in a long-running query case.
>
> I see your point.

My guess is that this concern of yours is somehow related to how we do
deletion and recycling *in general*. Currently (and even in v3 of the
patch), we assume that recycling the pages that a VACUUM operation
deletes will happen "eventually". This kind of makes sense when you
have "typical vacuuming" -- deletes/updates, and no big bursts, rare
bulk deletes, etc.

But when you do have a mixture of different triggering positions,
which is quite possible, it is difficult to understand what
"eventually" actually means...

> BTW, I am thinking about making recycling take place for pages that
> were deleted during the same VACUUM. We can just use a
> work_mem-limited array to remember a list of blocks that are deleted
> but not yet recyclable (plus the XID found in the block).

...which brings me back to this idea.

I've prototyped this. It works really well. In most cases the
prototype makes VACUUM operations with nbtree index page deletions
also recycle the pages that were deleted, at the end of the
btvacuumscan(). We do very little or no "indefinite deferring" work
here. This has obvious advantages, of course, but it also has a
non-obvious advantage: the awkward question of concerning "what
eventually actually means" with mixed triggering conditions over time
mostly goes away. So perhaps this actually addresses your concern,
Masahiko.

I've been testing this with BenchmarkSQL [1], which has several
indexes that regularly need page deletions. There is also a realistic
"life cycle" to the data in these indexes. I added custom
instrumentation to display information about what's going on with page
deletion when the benchmark is run. I wrote a quick-and-dirty patch
that makes log_autovacuum show the same information that you see about
index page deletion when VACUUM VERBOSE is run (including the new
pages_newly_deleted field from my patch). With this particular
TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
on to place every page that it deletes in the FSM without leaving
anything to the next VACUUM. There are a very small number of
exceptions where we "only" manage to recycle maybe 95% of the pages
that were deleted.

The race condition that nbtree avoids by deferring recycling was
always a narrow one, outside of the extremes -- the way we defer has
always been overkill. It's almost always unnecessary to delay placing
deleted pages in the FSM until the *next* VACUUM. We only have to
delay it until the end of the *same* VACUUM -- why wait until the next
VACUUM if we don't have to? In general this deferring recycling
business has nothing to do with MVCC/GC/whatever, and yet the code
seems to suggest that it does. While it is convenient to use an XID
for page deletion and recycling as a way of implementing what Lanin &
Shasha call "the drain technique" [2], all we have to do is prevent
certain race conditions. This is all about the index itself, the data
structure, how it is maintained -- nothing more. It almost seems
obvious to me.

It's still possible to imagine extremes. Extremes that even the "try
to recycle pages we ourselves deleted when we reach the end of
btvacuumscan()" version of my patch cannot deal with. Maybe it really
is true that it's inherently impossible to recycle a deleted page even
at the end of a VACUUM -- maybe a long-running transaction (that could
in principle have a stale link to our deleted page) starts before we
VACUUM, and lasts after VACUUM finishes. So it's just not safe. When
that happens, we're back to having the original problem: we're relying
on some *future* VACUUM operation to do that for us at some indefinite
point in the future. It's fair to wonder: What are the implications of
that? Are we not back to square one? Don't we have the same "what does
'eventually' really mean" problem once again?

I think that that's okay, because this remaining case is a *truly*
extreme case (especially with a large index, where index vacuuming
will naturally take a long time).

It will be rare. But more importantly, the fact that scenario is now
an extreme case justifies treating it as an extreme case. We can teach
_bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
particular, I think that it will now be okay to increase the threshold
applied when considering deleted pages inside
_bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
patch. But in v4, which has the new recycling enhancement, I think
that it would be sensible to make it 5%, or maybe even 10%. This
naturally makes Masahiko's problem scenario unlikely to actually
result in a truly wasted call to btvacuumscan(). The number of pages
that the metapage indicates are "deleted but not yet placed in the
FSM" will be close to the theoretical minimum, because we're no longer
naively throwing away information about which specific pages will be
recyclable soon. Which is what the current approach does, really.

[1] https://github.com/wieck/benchmarksql
[2] https://archive.org/stream/symmetricconcurr00lani#page/8/mode/2up
-- see "2.5 Freeing Empty Nodes"
--
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Sat, Feb 13, 2021 at 10:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It will be rare. But more importantly, the fact that scenario is now
> an extreme case justifies treating it as an extreme case. We can teach
> _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
> particular, I think that it will now be okay to increase the threshold
> applied when considering deleted pages inside
> _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
> patch. But in v4, which has the new recycling enhancement, I think
> that it would be sensible to make it 5%, or maybe even 10%. This
> naturally makes Masahiko's problem scenario unlikely to actually
> result in a truly wasted call to btvacuumscan().

Attached is v4, which has the "recycle pages that we ourselves deleted
during this same VACUUM operation" enhancement. It also doubles the
_bt_vacuum_needs_cleanup() threshold applied to deleted pages -- it
goes from 2.5% to 5%. The new patch is the patch series (v4-0002-*)
certainly needs more polishing. I'm posting what I have now because v3
has bitrot.

Benchmarking has shown that the enhancement in v4-0002-* can
significantly reduce the amount of index bloat in two of the
BenchmarkSQL/TPC-C indexes.

-- 
Peter Geoghegan

Attachment

Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Sun, Feb 14, 2021 at 3:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > I agree that there already are huge problems in that case. But I think
> > > we need to consider an append-only case as well; after bulk deletion
> > > on an append-only table, vacuum deletes heap tuples and index tuples,
> > > marking some index pages as dead and setting an XID into btpo.xact.
> > > Since we trigger autovacuums even by insertions based on
> > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> > > the table again. But if there is a long-running query a "wasted"
> > > cleanup scan could happen many times depending on the values of
> > > autovacuum_vacuum_insert_scale_factor/threshold and
> > > vacuum_cleanup_index_scale_factor. This should not happen in the old
> > > code. I agree this is DBA problem but it also means this could bring
> > > another new problem in a long-running query case.
> >
> > I see your point.
>
> My guess is that this concern of yours is somehow related to how we do
> deletion and recycling *in general*. Currently (and even in v3 of the
> patch), we assume that recycling the pages that a VACUUM operation
> deletes will happen "eventually". This kind of makes sense when you
> have "typical vacuuming" -- deletes/updates, and no big bursts, rare
> bulk deletes, etc.
>
> But when you do have a mixture of different triggering positions,
> which is quite possible, it is difficult to understand what
> "eventually" actually means...
>
> > BTW, I am thinking about making recycling take place for pages that
> > were deleted during the same VACUUM. We can just use a
> > work_mem-limited array to remember a list of blocks that are deleted
> > but not yet recyclable (plus the XID found in the block).
>
> ...which brings me back to this idea.
>
> I've prototyped this. It works really well. In most cases the
> prototype makes VACUUM operations with nbtree index page deletions
> also recycle the pages that were deleted, at the end of the
> btvacuumscan(). We do very little or no "indefinite deferring" work
> here. This has obvious advantages, of course, but it also has a
> non-obvious advantage: the awkward question of concerning "what
> eventually actually means" with mixed triggering conditions over time
> mostly goes away. So perhaps this actually addresses your concern,
> Masahiko.

Yes. I think this would simplify the problem by resolving almost all
problems related to indefinite deferring page recycle.

We will be able to recycle almost all just-deleted pages in practice
especially when btvacuumscan() took a long time. And there would not
be a noticeable downside, I think.

BTW if btree index starts to use maintenan_work_mem for this purpose,
we also need to set amusemaintenanceworkmem to true which is
considered when parallel vacuum.

>
> I've been testing this with BenchmarkSQL [1], which has several
> indexes that regularly need page deletions. There is also a realistic
> "life cycle" to the data in these indexes. I added custom
> instrumentation to display information about what's going on with page
> deletion when the benchmark is run. I wrote a quick-and-dirty patch
> that makes log_autovacuum show the same information that you see about
> index page deletion when VACUUM VERBOSE is run (including the new
> pages_newly_deleted field from my patch). With this particular
> TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
> on to place every page that it deletes in the FSM without leaving
> anything to the next VACUUM. There are a very small number of
> exceptions where we "only" manage to recycle maybe 95% of the pages
> that were deleted.

Great!

>
> The race condition that nbtree avoids by deferring recycling was
> always a narrow one, outside of the extremes -- the way we defer has
> always been overkill. It's almost always unnecessary to delay placing
> deleted pages in the FSM until the *next* VACUUM. We only have to
> delay it until the end of the *same* VACUUM -- why wait until the next
> VACUUM if we don't have to? In general this deferring recycling
> business has nothing to do with MVCC/GC/whatever, and yet the code
> seems to suggest that it does. While it is convenient to use an XID
> for page deletion and recycling as a way of implementing what Lanin &
> Shasha call "the drain technique" [2], all we have to do is prevent
> certain race conditions. This is all about the index itself, the data
> structure, how it is maintained -- nothing more. It almost seems
> obvious to me.

Agreed.

>
> It's still possible to imagine extremes. Extremes that even the "try
> to recycle pages we ourselves deleted when we reach the end of
> btvacuumscan()" version of my patch cannot deal with. Maybe it really
> is true that it's inherently impossible to recycle a deleted page even
> at the end of a VACUUM -- maybe a long-running transaction (that could
> in principle have a stale link to our deleted page) starts before we
> VACUUM, and lasts after VACUUM finishes. So it's just not safe. When
> that happens, we're back to having the original problem: we're relying
> on some *future* VACUUM operation to do that for us at some indefinite
> point in the future. It's fair to wonder: What are the implications of
> that? Are we not back to square one? Don't we have the same "what does
> 'eventually' really mean" problem once again?
>
> I think that that's okay, because this remaining case is a *truly*
> extreme case (especially with a large index, where index vacuuming
> will naturally take a long time).

Right.

>
> It will be rare. But more importantly, the fact that scenario is now
> an extreme case justifies treating it as an extreme case. We can teach
> _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
> particular, I think that it will now be okay to increase the threshold
> applied when considering deleted pages inside
> _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
> patch. But in v4, which has the new recycling enhancement, I think
> that it would be sensible to make it 5%, or maybe even 10%. This
> naturally makes Masahiko's problem scenario unlikely to actually
> result in a truly wasted call to btvacuumscan(). The number of pages
> that the metapage indicates are "deleted but not yet placed in the
> FSM" will be close to the theoretical minimum, because we're no longer
> naively throwing away information about which specific pages will be
> recyclable soon. Which is what the current approach does, really.
>

Yeah, increasing the threshold would solve the problem in most cases.
Given that nbtree index page deletion is unlikely to happen in
practice, having the threshold 5% or 10% seems to avoid the problem in
nearly 100% of cases, I think.

Another idea I come up with (maybe on top of above your idea) is to
change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
btpo.xact XID among all deleted pages when the total amount of deleted
pages exceeds 2% of index. That way, we surely can recycle more than
2% of index when the XID becomes older than the global xmin.

Also, maybe we can record deleted pages to FSM even without deferring
and check it when re-using. That is, when we get a free page from FSM
we check if the page is really recyclable (maybe _bt_getbuf() already
does this?). IOW, a deleted page can be recycled only when it's
requested to be reused. If btpo.xact is 64-bit XID we never need to
worry about the case where a deleted page never be requested to be
reused.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Yes. I think this would simplify the problem by resolving almost all
> problems related to indefinite deferring page recycle.
>
> We will be able to recycle almost all just-deleted pages in practice
> especially when btvacuumscan() took a long time. And there would not
> be a noticeable downside, I think.

Great!

> BTW if btree index starts to use maintenan_work_mem for this purpose,
> we also need to set amusemaintenanceworkmem to true which is
> considered when parallel vacuum.

I was just going to use work_mem. This should be okay. Note that
CREATE INDEX uses an additional work_mem allocation when building a
unique index, for the second spool/tuplesort. That seems like a
precedent that I can follow here.

Right now the BTPendingRecycle struct the patch uses to store
information about a page that the current VACUUM deleted (and may yet
be able to place in the FSM) are each 16 bytes (including alignment
overhead). I could probably make them smaller with a little work, but
even now that's quite small. Even with the default 4MiB work_mem
setting we can fit information about 262144 pages all at once. That's
2GiB worth of deleted index pages, which is generally much more than
we'll need.

> Yeah, increasing the threshold would solve the problem in most cases.
> Given that nbtree index page deletion is unlikely to happen in
> practice, having the threshold 5% or 10% seems to avoid the problem in
> nearly 100% of cases, I think.

Of course it all depends on workload/index characteristics, in the
end. It is very rare to delete a percentage of the index that exceeds
autovacuum_vacuum_scale_factor -- that's the important thing here IMV.

> Another idea I come up with (maybe on top of above your idea) is to
> change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
> btpo.xact XID among all deleted pages when the total amount of deleted
> pages exceeds 2% of index. That way, we surely can recycle more than
> 2% of index when the XID becomes older than the global xmin.

You could make my basic approach to recycling deleted pages earlier
(ideally at the end of the same btvacuumscan() that deleted the pages
in the first place) more sophisticated in a variety of ways. These are
all subject to diminishing returns, though.

I've already managed to recycle close to 100% of all B-Tree pages
during the same VACUUM with a very simple approach -- at least if we
assume BenchmarkSQL is representative. It is hard to know how much
more effort can be justified. To be clear, I'm not saying that an
improved design cannot be justified now or in the future (BenchmarkSQL
is not like many workloads that people use Postgres for). I'm just
saying that I *don't know* where to draw the line. Any particular
place that we draw the line feels a little arbitrary to me. This
includes my own choice of the work_mem-limited BTPendingRecycle array.
My patch currently works that way because it's simple -- no other
reason.

Any scheme to further improve the "work_mem-limited BTPendingRecycle
array" design from my patch boils down to this: A new approach that
makes recycling of any remaining deleted pages take place "before too
long": After the end of the btvacuumscan() BTPendingRecycle array
stuff (presumably that didn't work out in cases where an improved
approach matters), but before the next VACUUM takes place (since that
will do the required recycling anyway, unless it's unable to do any
work at all, in which case it hardly matters). Here are two ideas of
my own in this same class as your idea:

1. Remember to do some of the BTPendingRecycle array FSM processing
stuff in btvacuumcleanup() -- defer some of the recycling of pages
recorded in BTPendingRecycle entries (paged deleted during
btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.

Right now btvacuumcleanup() will always do nothing when btbulkdelete()
was called earlier. But that's just a current nbtree convention, and
is no reason to not do this (we don't have to scan the index again at
all). The advantage of teaching btvacuumcleanup() to do this is that
it delays the "BTPendingRecycle array FSM processing" stuff until the
last moment that it is still easy to use the in-memory array (because
we haven't freed it yet). In general, doing it later makes it more
likely that we'll successfully recycle the pages. Though in general it
might not make any difference -- so we're still hoping that the
workload allows us to recycle everything we deleted, without making
the design much more complicated than what I posted already.

(BTW I see that you reviewed commit 4e514c61, so you must have thought
about the trade-off between doing deferred recycling in
amvacuumcleanup() vs ambulkdelete(), when to call
IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
implement this idea while calling IndexFreeSpaceMapVacuum() during
both btvacuumcleanup() and btbulkdelete(), so that we get the best of
both worlds -- fast recycling *and* more delayed processing that is
more likely to ultimately succeed.)

2. Remember/serialize the BTPendingRecycle array when we realize that
we cannot put all recyclable pages in the FSM at the end of the
current btvacuumscan(), and then use an autovacuum work item to
process them before too long -- a call to AutoVacuumRequestWork()
could even serialize the data on disk.

Idea 2 has the advantage of allowing retries -- eventually it will be
safe to recycle the pages, if we just wait long enough.

Anyway, I'm probably not going to pursue either of the 2 ideas for
Postgres 14. I'm mentioning these ideas now because the trade-offs
show that there is no perfect design for this deferring recycling
stuff. Whatever we do, we should accept that there is no perfect
design.

Actually, there is one more reason why I bring up idea 1 now: I want
to hear your thoughts on the index AM API questions now, which idea 1
touches on. Ideally all of the details around the index AM VACUUM APIs
(i.e. when and where the extra work happens -- btvacuumcleanup() vs
btbulkdelete()) won't need to change much in the future. I worry about
getting this index AM API stuff right, at least a little.

> Also, maybe we can record deleted pages to FSM even without deferring
> and check it when re-using. That is, when we get a free page from FSM
> we check if the page is really recyclable (maybe _bt_getbuf() already
> does this?). IOW, a deleted page can be recycled only when it's
> requested to be reused. If btpo.xact is 64-bit XID we never need to
> worry about the case where a deleted page never be requested to be
> reused.

I've thought about that too (both now and in the past). You're right
about _bt_getbuf() -- it checks the XID, at least on the master
branch. I took that XID check out in v4 of the patch, but I am now
starting to have my doubts about that particular choice. (I'm probably
going to restore the XID check in _bt_getbuf in v5 of the patch.)

I took the XID-is-recyclable check out in v4 of the patch because it
might leak pages in rare cases -- which is not a new problem.
_bt_getbuf() currently has a remarkably relaxed attitude about leaking
pages from the FSM (it is more relaxed about it than I am, certainly)
-- but why should we just accept leaking pages like that? My new
doubts about it are non-specific, though. We know that the FSM isn't
crash safe -- but I think that that reduces to "practically speaking,
we can never 100% trust the FSM". Which makes me nervous. I worry that
the FSM can do something completely evil and crazy in rare cases.

It's not just crash safety. The FSM's fsm_search_avail() function
currently changes the fp_next_slot field with only a shared buffer
lock held. It's an int, which is supposed to "be atomic on most
platforms". But we should be using real atomic ops. So the FSM is
generally...kind of wonky.

In an ideal world, nbtree page deletion + recycling would have crash
safety built in. I don't think that it makes sense to not have free
space management without crash safety in the case of index AMs,
because it's just not worth it with whole-page units of free space
(heapam is another story). A 100% crash-safe design would naturally
shift the problem of nbtree page recycle safety from the
producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree
would be a real improvement. But these long standing FSM issues are
not going to change for Postgres 14. And so changing _bt_getbuf() to
do clever things with XIDs won't be possible for Postgres 14 IMV.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, there is one more reason why I bring up idea 1 now: I want
> to hear your thoughts on the index AM API questions now, which idea 1
> touches on. Ideally all of the details around the index AM VACUUM APIs
> (i.e. when and where the extra work happens -- btvacuumcleanup() vs
> btbulkdelete()) won't need to change much in the future. I worry about
> getting this index AM API stuff right, at least a little.

Speaking of problems like this, I think I spotted an old one: we call
_bt_update_meta_cleanup_info() in either btbulkdelete() or
btvacuumcleanup(). I think that we should always call it in
btvacuumcleanup(), though -- even in cases where there is no call to
btvacuumscan() inside btvacuumcleanup() (because btvacuumscan()
happened earlier instead, during the btbulkdelete() call).

This makes the value of IndexVacuumInfo.num_heap_tuples (which is what
we store in the metapage) much more accurate -- right now it's always
pg_class.reltuples from *before* the VACUUM started. And so the
btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often
kind of inaccurate.

This "estimate during ambulkdelete" issue is documented here (kind of):

/*
 * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
 *
 * num_heap_tuples is accurate only when estimated_count is false;
 * otherwise it's just an estimate (currently, the estimate is the
 * prior value of the relation's pg_class.reltuples field, so it could
 * even be -1).  It will always just be an estimate during ambulkdelete.
 */
typedef struct IndexVacuumInfo
{
    ...
}

The name of the metapage field is already
btm_last_cleanup_num_heap_tuples, which already suggests the approach
that I propose now. So why don't we do it like that already?

(Thinks some more...)

I wonder: did this detail change at the last minute during the
development of the feature (just before commit 857f9c36) back in early
2018? That change have made it easier to deal with
oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to
the patch -- so maybe it's truly an accident that the code doesn't
work the way that I suggest it should already. (It's annoying to make
state from btbulkdelete() appear in btvacuumcleanup(), unless it's
from IndexVacuumInfo or something -- I can imagine this changing at
the last minute, just for that reason.)

Do you think that this needs to be treated as a bug in the
backbranches, Masahiko? I'm not sure...

In any case we should probably make this change as part of Postgres
14. Don't you think? It's certainly easy to do it this way now, since
there will be no need to keep around a oldestBtpoXact value until
btvacuumcleanup() (in the common case where btbulkdelete() is where we
call btvacuumscan()). The new btm_last_cleanup_num_delpages field
(which replaces btm_oldest_btpo_xact) has a value that just comes from
the bulk stats, which is easy anyway.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Tue, Feb 16, 2021 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Actually, there is one more reason why I bring up idea 1 now: I want
> > to hear your thoughts on the index AM API questions now, which idea 1
> > touches on. Ideally all of the details around the index AM VACUUM APIs
> > (i.e. when and where the extra work happens -- btvacuumcleanup() vs
> > btbulkdelete()) won't need to change much in the future. I worry about
> > getting this index AM API stuff right, at least a little.
>
> Speaking of problems like this, I think I spotted an old one: we call
> _bt_update_meta_cleanup_info() in either btbulkdelete() or
> btvacuumcleanup(). I think that we should always call it in
> btvacuumcleanup(), though -- even in cases where there is no call to
> btvacuumscan() inside btvacuumcleanup() (because btvacuumscan()
> happened earlier instead, during the btbulkdelete() call).
>
> This makes the value of IndexVacuumInfo.num_heap_tuples (which is what
> we store in the metapage) much more accurate -- right now it's always
> pg_class.reltuples from *before* the VACUUM started. And so the
> btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often
> kind of inaccurate.
>
> This "estimate during ambulkdelete" issue is documented here (kind of):
>
> /*
>  * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
>  *
>  * num_heap_tuples is accurate only when estimated_count is false;
>  * otherwise it's just an estimate (currently, the estimate is the
>  * prior value of the relation's pg_class.reltuples field, so it could
>  * even be -1).  It will always just be an estimate during ambulkdelete.
>  */
> typedef struct IndexVacuumInfo
> {
>     ...
> }
>
> The name of the metapage field is already
> btm_last_cleanup_num_heap_tuples, which already suggests the approach
> that I propose now. So why don't we do it like that already?
>
> (Thinks some more...)
>
> I wonder: did this detail change at the last minute during the
> development of the feature (just before commit 857f9c36) back in early
> 2018? That change have made it easier to deal with
> oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to
> the patch -- so maybe it's truly an accident that the code doesn't
> work the way that I suggest it should already. (It's annoying to make
> state from btbulkdelete() appear in btvacuumcleanup(), unless it's
> from IndexVacuumInfo or something -- I can imagine this changing at
> the last minute, just for that reason.)
>
> Do you think that this needs to be treated as a bug in the
> backbranches, Masahiko? I'm not sure...

Ugh, yes, I think it's a bug.

When developing this feature, in an old version patch, we used to set
invalid values to both btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these
values. But we decided to set valid values to both even in
btbulkdelete(). I believe that decision was correct in terms of
btm_oldest_btpo_xact because with the old version patch we will do an
unnecessary index scan during btvacuumcleanup(). But it’s wrong in
terms of btm_last_cleanup_num_heap_tuples, as you pointed out.

This bug would make the check of vacuum_cleanup_index_scale_factor
untrust. So I think it’s better to backpatch but I think we need to
note that to fix this issue properly, in a case where a vacuum called
btbulkdelete() earlier, probably we should update only
btm_oldest_btpo_xact in btbulkdelete() and then update
btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case,
we don’t know the oldest btpo.xact among the deleted pages in
btvacuumcleanup(). This means that we would need to update the meta
page twice, leading to WAL logging twice. Since we already could
update the meta page more than once when a vacuum calls btbulkdelete()
multiple times I think it would not be a problem, though.

>
> In any case we should probably make this change as part of Postgres
> 14. Don't you think? It's certainly easy to do it this way now, since
> there will be no need to keep around a oldestBtpoXact value until
> btvacuumcleanup() (in the common case where btbulkdelete() is where we
> call btvacuumscan()). The new btm_last_cleanup_num_delpages field
> (which replaces btm_oldest_btpo_xact) has a value that just comes from
> the bulk stats, which is easy anyway.

Agreed.

As I mentioned above, we might need to consider how btbulkdelete() can
tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a
vacuum called btbulkdelete earlier. During parallel vacuum, two
different processes could do btbulkdelete() and btvacuumcleanup()
respectively. Updating those values separately in those callbacks
would be straightforward.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Tue, Feb 16, 2021 at 4:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Ugh, yes, I think it's a bug.

I was actually thinking of a similar bug in nbtree deduplication when
I spotted this one -- see commit 48e12913. The index AM API stuff is
tricky.

> When developing this feature, in an old version patch, we used to set
> invalid values to both btm_oldest_btpo_xact and
> btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these
> values. But we decided to set valid values to both even in
> btbulkdelete(). I believe that decision was correct in terms of
> btm_oldest_btpo_xact because with the old version patch we will do an
> unnecessary index scan during btvacuumcleanup(). But it’s wrong in
> terms of btm_last_cleanup_num_heap_tuples, as you pointed out.

Right.

> This bug would make the check of vacuum_cleanup_index_scale_factor
> untrust. So I think it’s better to backpatch but I think we need to
> note that to fix this issue properly, in a case where a vacuum called
> btbulkdelete() earlier, probably we should update only
> btm_oldest_btpo_xact in btbulkdelete() and then update
> btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case,
> we don’t know the oldest btpo.xact among the deleted pages in
> btvacuumcleanup(). This means that we would need to update the meta
> page twice, leading to WAL logging twice. Since we already could
> update the meta page more than once when a vacuum calls btbulkdelete()
> multiple times I think it would not be a problem, though.

I agree that that approach is fine. Realistically, we won't even have
to update the metapage twice in most cases. Because most indexes never
have even one page deletion anyway.

> As I mentioned above, we might need to consider how btbulkdelete() can
> tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a
> vacuum called btbulkdelete earlier. During parallel vacuum, two
> different processes could do btbulkdelete() and btvacuumcleanup()
> respectively. Updating those values separately in those callbacks
> would be straightforward.

I don't see why it should be a problem for my patch/Postgres 14,
because we don't have the same btpo.xact/oldestBtpoXact issue that the
original Postgres 11 commit dealt with. The patch determines a value
for btm_last_cleanup_num_delpages (which I call
pages_deleted_not_free) by subtracting fields from the bulk delete
stats: we just use "stats->pages_deleted - stats->pages_free".

Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine)
entitled to rely on the bulk delete stats being set in the way I've
described? I assumed that that was okay in general, but I haven't
tested parallel VACUUM specifically. Will parallel VACUUM really fail
to ensure that values in bulk stats fields (like pages_deleted and
pages_free) get set correctly for amvacuumcleanup() callbacks?

--
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Tue, Feb 16, 2021 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine)
> entitled to rely on the bulk delete stats being set in the way I've
> described? I assumed that that was okay in general, but I haven't
> tested parallel VACUUM specifically. Will parallel VACUUM really fail
> to ensure that values in bulk stats fields (like pages_deleted and
> pages_free) get set correctly for amvacuumcleanup() callbacks?

I tested the pages_deleted_not_free stuff with a version of my patch
that consistently calls _bt_update_meta_cleanup_info() during
btvacuumcleanup(), and never during btbulkdelete(). And it works just
fine -- including with parallel VACUUM.

Evidently my understanding of what btvacuumcleanup() (or any other
amvacuumcleanup() routine) can expect from bulk delete stats was
correct. It doesn't matter whether or not parallel VACUUM happens to
be involved -- it works just as well.

This is good news, since of course it means that it's okay to stick to
the simple approach of calculating pages_deleted_not_free. Passing
pages_deleted_not_free (a.k.a. btm_last_cleanup_num_delpages) to
_bt_update_meta_cleanup_info() during btvacuumcleanup() works just as
well when combined with my fix for the the
"IndexVacuumInfo.num_heap_tuples is inaccurate during btbulkdelete()"
bug. That approach to fixing the IndexVacuumInfo.num_heap_tuples bug
creates no new problems for my patch. There is still no need to think
about when or how the relevant bulk delete fields (pages_deleted and
pages_free) were set. And it doesn't matter whether or not parallel
VACUUM is involved.

(Of course it's also true that we can't do that on the backbranches.
Purely because we must worry about btpo.xact/oldestBtpoXact on the
backbranches. We'll probably have to teach the code in released
versions to set btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in separate calls -- since there is
no easy way to "send" the oldestBtpoXact value determined during a
btbulkdelete() to a later corresponding btvacuumcleanup(). That's a
bit of a kludge, but I'm not worried about it.)

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Wed, Feb 17, 2021 at 5:41 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Feb 16, 2021 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine)
> > entitled to rely on the bulk delete stats being set in the way I've
> > described? I assumed that that was okay in general, but I haven't
> > tested parallel VACUUM specifically. Will parallel VACUUM really fail
> > to ensure that values in bulk stats fields (like pages_deleted and
> > pages_free) get set correctly for amvacuumcleanup() callbacks?
>
> I tested the pages_deleted_not_free stuff with a version of my patch
> that consistently calls _bt_update_meta_cleanup_info() during
> btvacuumcleanup(), and never during btbulkdelete(). And it works just
> fine -- including with parallel VACUUM.
>
> Evidently my understanding of what btvacuumcleanup() (or any other
> amvacuumcleanup() routine) can expect from bulk delete stats was
> correct. It doesn't matter whether or not parallel VACUUM happens to
> be involved -- it works just as well.

Yes, you're right. I missed that pages_deleted_not_free is calculated
by (stats->pages_deleted - stats->pages_free) where both are in
IndexBulkDeleteResult.

>
> This is good news, since of course it means that it's okay to stick to
> the simple approach of calculating pages_deleted_not_free. Passing
> pages_deleted_not_free (a.k.a. btm_last_cleanup_num_delpages) to
> _bt_update_meta_cleanup_info() during btvacuumcleanup() works just as
> well when combined with my fix for the the
> "IndexVacuumInfo.num_heap_tuples is inaccurate during btbulkdelete()"
> bug. That approach to fixing the IndexVacuumInfo.num_heap_tuples bug
> creates no new problems for my patch. There is still no need to think
> about when or how the relevant bulk delete fields (pages_deleted and
> pages_free) were set. And it doesn't matter whether or not parallel
> VACUUM is involved.

Agreed.

>
> (Of course it's also true that we can't do that on the backbranches.
> Purely because we must worry about btpo.xact/oldestBtpoXact on the
> backbranches. We'll probably have to teach the code in released
> versions to set btm_oldest_btpo_xact and
> btm_last_cleanup_num_heap_tuples in separate calls -- since there is
> no easy way to "send" the oldestBtpoXact value determined during a
> btbulkdelete() to a later corresponding btvacuumcleanup(). That's a
> bit of a kludge, but I'm not worried about it.)

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Tue, Feb 16, 2021 at 12:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Yes. I think this would simplify the problem by resolving almost all
> > problems related to indefinite deferring page recycle.
> >
> > We will be able to recycle almost all just-deleted pages in practice
> > especially when btvacuumscan() took a long time. And there would not
> > be a noticeable downside, I think.
>
> Great!
>
> > BTW if btree index starts to use maintenan_work_mem for this purpose,
> > we also need to set amusemaintenanceworkmem to true which is
> > considered when parallel vacuum.
>
> I was just going to use work_mem. This should be okay. Note that
> CREATE INDEX uses an additional work_mem allocation when building a
> unique index, for the second spool/tuplesort. That seems like a
> precedent that I can follow here.
>
> Right now the BTPendingRecycle struct the patch uses to store
> information about a page that the current VACUUM deleted (and may yet
> be able to place in the FSM) are each 16 bytes (including alignment
> overhead). I could probably make them smaller with a little work, but
> even now that's quite small. Even with the default 4MiB work_mem
> setting we can fit information about 262144 pages all at once. That's
> 2GiB worth of deleted index pages, which is generally much more than
> we'll need.

Cool.

>
> > Yeah, increasing the threshold would solve the problem in most cases.
> > Given that nbtree index page deletion is unlikely to happen in
> > practice, having the threshold 5% or 10% seems to avoid the problem in
> > nearly 100% of cases, I think.
>
> Of course it all depends on workload/index characteristics, in the
> end. It is very rare to delete a percentage of the index that exceeds
> autovacuum_vacuum_scale_factor -- that's the important thing here IMV.
>
> > Another idea I come up with (maybe on top of above your idea) is to
> > change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
> > btpo.xact XID among all deleted pages when the total amount of deleted
> > pages exceeds 2% of index. That way, we surely can recycle more than
> > 2% of index when the XID becomes older than the global xmin.
>
> You could make my basic approach to recycling deleted pages earlier
> (ideally at the end of the same btvacuumscan() that deleted the pages
> in the first place) more sophisticated in a variety of ways. These are
> all subject to diminishing returns, though.
>
> I've already managed to recycle close to 100% of all B-Tree pages
> during the same VACUUM with a very simple approach -- at least if we
> assume BenchmarkSQL is representative. It is hard to know how much
> more effort can be justified. To be clear, I'm not saying that an
> improved design cannot be justified now or in the future (BenchmarkSQL
> is not like many workloads that people use Postgres for). I'm just
> saying that I *don't know* where to draw the line. Any particular
> place that we draw the line feels a little arbitrary to me. This
> includes my own choice of the work_mem-limited BTPendingRecycle array.
> My patch currently works that way because it's simple -- no other
> reason.
>
> Any scheme to further improve the "work_mem-limited BTPendingRecycle
> array" design from my patch boils down to this: A new approach that
> makes recycling of any remaining deleted pages take place "before too
> long": After the end of the btvacuumscan() BTPendingRecycle array
> stuff (presumably that didn't work out in cases where an improved
> approach matters), but before the next VACUUM takes place (since that
> will do the required recycling anyway, unless it's unable to do any
> work at all, in which case it hardly matters).

I agreed with this direction.

> Here are two ideas of
> my own in this same class as your idea:
>
> 1. Remember to do some of the BTPendingRecycle array FSM processing
> stuff in btvacuumcleanup() -- defer some of the recycling of pages
> recorded in BTPendingRecycle entries (paged deleted during
> btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.
>
> Right now btvacuumcleanup() will always do nothing when btbulkdelete()
> was called earlier. But that's just a current nbtree convention, and
> is no reason to not do this (we don't have to scan the index again at
> all). The advantage of teaching btvacuumcleanup() to do this is that
> it delays the "BTPendingRecycle array FSM processing" stuff until the
> last moment that it is still easy to use the in-memory array (because
> we haven't freed it yet). In general, doing it later makes it more
> likely that we'll successfully recycle the pages. Though in general it
> might not make any difference -- so we're still hoping that the
> workload allows us to recycle everything we deleted, without making
> the design much more complicated than what I posted already.
>
> (BTW I see that you reviewed commit 4e514c61, so you must have thought
> about the trade-off between doing deferred recycling in
> amvacuumcleanup() vs ambulkdelete(), when to call
> IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
> implement this idea while calling IndexFreeSpaceMapVacuum() during
> both btvacuumcleanup() and btbulkdelete(), so that we get the best of
> both worlds -- fast recycling *and* more delayed processing that is
> more likely to ultimately succeed.)

I think this idea 1 also needs to serialize BTPendingRecycle array
somewhere to pass it to a parallel vacuum worker in parallel vacuum
case.

Delaying the "BTPendingRecycle array FSM processing" stuff until
btvacuumcleanup() is a good idea. But I think it's a relatively rare
case in practice where index vacuum runs more than once (i.g., using
up maintenance_work_mem). So considering the development cost of
serializing BTPendingRecycle array and index AM API changes,
attempting to recycle the deleted pages at the end of btvacuumscan()
would be a balanced strategy.

>
> 2. Remember/serialize the BTPendingRecycle array when we realize that
> we cannot put all recyclable pages in the FSM at the end of the
> current btvacuumscan(), and then use an autovacuum work item to
> process them before too long -- a call to AutoVacuumRequestWork()
> could even serialize the data on disk.
>
> Idea 2 has the advantage of allowing retries -- eventually it will be
> safe to recycle the pages, if we just wait long enough.

This is a good idea too. Perhaps autovacuum needs to end up with an
error to retry later again in case where it could not recycle all
deleted pages.

I have thought too about the idea to store pending-recycle pages
somewhere to avoid index scan when we do the XID-is-recyclable check.
My idea was to store them to btree pages dedicated for this purpose
linked from the meta page but I prefer your idea.

>
> Anyway, I'm probably not going to pursue either of the 2 ideas for
> Postgres 14. I'm mentioning these ideas now because the trade-offs
> show that there is no perfect design for this deferring recycling
> stuff. Whatever we do, we should accept that there is no perfect
> design.
>
> Actually, there is one more reason why I bring up idea 1 now: I want
> to hear your thoughts on the index AM API questions now, which idea 1
> touches on. Ideally all of the details around the index AM VACUUM APIs
> (i.e. when and where the extra work happens -- btvacuumcleanup() vs
> btbulkdelete()) won't need to change much in the future. I worry about
> getting this index AM API stuff right, at least a little.

After introducing parallel vacuum, index AMs are not able to pass
arbitary information taken in ambulkdlete() to amvacuumcleanup(), like
old gist index code does. If there is a good use case where needs to
pass arbitary information to amvacuumcleanup(), I think it'd be a good
idea to add an index AM API so that parallel vacuum serialize it and
tells another parallel vacuum worker. But, as I mentinoed above, given
that vacuum calls ambulkdelete() only once in most cases and I think
we’d like to improve how to store TIDs in maintenance_work_mem space
(discussed a little on thread[1]), delaying "the BTPendingRecycle
array FSM processing stuff in btvacuumcleanup()” would not be a good
usecase.

>
> > Also, maybe we can record deleted pages to FSM even without deferring
> > and check it when re-using. That is, when we get a free page from FSM
> > we check if the page is really recyclable (maybe _bt_getbuf() already
> > does this?). IOW, a deleted page can be recycled only when it's
> > requested to be reused. If btpo.xact is 64-bit XID we never need to
> > worry about the case where a deleted page never be requested to be
> > reused.
>
> I've thought about that too (both now and in the past). You're right
> about _bt_getbuf() -- it checks the XID, at least on the master
> branch. I took that XID check out in v4 of the patch, but I am now
> starting to have my doubts about that particular choice. (I'm probably
> going to restore the XID check in _bt_getbuf in v5 of the patch.)
>
> I took the XID-is-recyclable check out in v4 of the patch because it
> might leak pages in rare cases -- which is not a new problem.
> _bt_getbuf() currently has a remarkably relaxed attitude about leaking
> pages from the FSM (it is more relaxed about it than I am, certainly)
> -- but why should we just accept leaking pages like that? My new
> doubts about it are non-specific, though. We know that the FSM isn't
> crash safe -- but I think that that reduces to "practically speaking,
> we can never 100% trust the FSM". Which makes me nervous. I worry that
> the FSM can do something completely evil and crazy in rare cases.
>
> It's not just crash safety. The FSM's fsm_search_avail() function
> currently changes the fp_next_slot field with only a shared buffer
> lock held. It's an int, which is supposed to "be atomic on most
> platforms". But we should be using real atomic ops. So the FSM is
> generally...kind of wonky.
>
> In an ideal world, nbtree page deletion + recycling would have crash
> safety built in. I don't think that it makes sense to not have free
> space management without crash safety in the case of index AMs,
> because it's just not worth it with whole-page units of free space
> (heapam is another story). A 100% crash-safe design would naturally
> shift the problem of nbtree page recycle safety from the
> producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree
> would be a real improvement. But these long standing FSM issues are
> not going to change for Postgres 14. And so changing _bt_getbuf() to
> do clever things with XIDs won't be possible for Postgres 14 IMV.

Agreed. Thanks for your explanation.

Regards,

[1] https://www.postgresql.org/message-id/flat/CA%2Bfd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Agreed. Thanks for your explanation.

Attached is v5, which has some of the changes I talked about. Changes
from v4 include:

* Now only updates metapage during btvacuumcleanup() in the first
patch, which is enough to fix the existing
IndexVacuumInfo.num_heap_tuples issue.

* Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia.

* The second patch in the series now respects work_mem when sizing the
BTPendingRecycle array.

* New enhancement to the XID GlobalVisCheckRemovableFullXid() test
used in the second patch, to allow it to recycle even more pages.
(Still unsure of some of the details here.)

I would like to commit the first patch in a few days -- I refer to the
big patch that makes deleted page XIDs 64-bit/full. Can you take a
look at that one, Masahiko? That would be helpful. I can produce a bug
fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I
think that that should be written after the first patch is finalized
and committed.

The second patch (the new recycling optimization) will require more
work and testing.

Thanks!
-- 
Peter Geoghegan

Attachment

Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Fri, Feb 19, 2021 at 3:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Agreed. Thanks for your explanation.
>
> Attached is v5, which has some of the changes I talked about. Changes
> from v4 include:
>
> * Now only updates metapage during btvacuumcleanup() in the first
> patch, which is enough to fix the existing
> IndexVacuumInfo.num_heap_tuples issue.
>
> * Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia.
>
> * The second patch in the series now respects work_mem when sizing the
> BTPendingRecycle array.
>
> * New enhancement to the XID GlobalVisCheckRemovableFullXid() test
> used in the second patch, to allow it to recycle even more pages.
> (Still unsure of some of the details here.)

Thank you for updating the patch!

>
> I would like to commit the first patch in a few days -- I refer to the
> big patch that makes deleted page XIDs 64-bit/full. Can you take a
> look at that one, Masahiko? That would be helpful. I can produce a bug
> fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I
> think that that should be written after the first patch is finalized
> and committed.

I'll look at the first patch first.

>
> The second patch (the new recycling optimization) will require more
> work and testing.

Then also look at those patches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Wed, Mar 10, 2021 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Here is another bitrot-fix-only revision, v9. Just the recycling patch again.

I committed the final nbtree page deletion patch just now -- the one
that attempts to make recycling happen for newly deleted pages. Thanks
for all your work on patch review, Masahiko!

I'll close out the CF item for this patch series now.

-- 
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Mon, Mar 22, 2021 at 7:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Mar 10, 2021 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Here is another bitrot-fix-only revision, v9. Just the recycling patch again.
>
> I committed the final nbtree page deletion patch just now -- the one
> that attempts to make recycling happen for newly deleted pages. Thanks
> for all your work on patch review, Masahiko!

You're welcome! Those are really good improvements.

By this patch series, btree indexes became like hash indexes in terms
of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the
two cases: metapage upgrading and more than 5%
deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do
we want to disable parallel index cleanup for btree indexes like hash
indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from
amparallelvacuumoptions. IMO we can live with the current
configuration just in case where the user runs into such rare
situations (especially for the latter case). In most cases, parallel
vacuum workers for index cleanup might exit with no-op but the
side-effect (wasting resources and overhead etc) would not be big. If
we want to enable it only in particular cases, we would need to have
another way for index AM to tell lazy vacuum whether or not to allow a
parallel worker to process the index at that time. What do you think?

I’m not sure we need changes but I think it’s worth discussing here.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: 64-bit XIDs in deleted nbtree pages

From
Peter Geoghegan
Date:
On Tue, Mar 23, 2021 at 8:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> By this patch series, btree indexes became like hash indexes in terms
> of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the
> two cases: metapage upgrading and more than 5%
> deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do
> we want to disable parallel index cleanup for btree indexes like hash
> indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from
> amparallelvacuumoptions.

My recent "Recycle nbtree pages deleted during same VACUUM" commit
improved the efficiency of recycling, but I still think that it was a
bit of a hack. Or at least it didn't go far enough in fixing the old
design, which is itself a bit of a hack.

As I said back on February 15, a truly good design for nbtree page
deletion + recycling would have crash safety built in. If page
deletion itself is crash safe, it really makes sense to make
everything crash safe (especially because we're managing large chunks
of equisized free space, unlike in heapam). And as I also said back
then, a 100% crash-safe design could naturally shift the problem of
nbtree page recycle safety from the producer/VACUUM side, to the
consumer/_bt_getbuf() side. It should be completely separated from
when VACUUM runs, and what VACUUM can discover about recycle safety in
passing, at the end.

That approach would completely eliminate the need to do any work in
btvacuumcleanup(), which would make it natural to remove
VACUUM_OPTION_PARALLEL_COND_CLEANUP from nbtree -- the implementation
of btvacuumcleanup() would just look like hashvacuumcleanup() does now
-- it could do practically nothing, making this 100% okay.

For now I have my doubts that it is appropriate to make this change.
It seems as if the question of whether or not
VACUUM_OPTION_PARALLEL_COND_CLEANUP should be used is basically the
same question as "Does the vacuumcleanup() callback for this index AM
look exactly like hashvacuumcleanup()?".

> IMO we can live with the current
> configuration just in case where the user runs into such rare
> situations (especially for the latter case). In most cases, parallel
> vacuum workers for index cleanup might exit with no-op but the
> side-effect (wasting resources and overhead etc) would not be big. If
> we want to enable it only in particular cases, we would need to have
> another way for index AM to tell lazy vacuum whether or not to allow a
> parallel worker to process the index at that time. What do you think?

I am concerned about unintended consequences, like never noticing that
we should really recycle known deleted pages not yet placed in the FSM
(it's hard to think through very rare cases like this with
confidence). Is it really so bad if we launch parallel workers that we
don't really need for a parallel VACUUM?

--
Peter Geoghegan



Re: 64-bit XIDs in deleted nbtree pages

From
Masahiko Sawada
Date:
On Wed, Mar 24, 2021 at 12:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Mar 23, 2021 at 8:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > By this patch series, btree indexes became like hash indexes in terms
> > of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the
> > two cases: metapage upgrading and more than 5%
> > deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do
> > we want to disable parallel index cleanup for btree indexes like hash
> > indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from
> > amparallelvacuumoptions.
>
> My recent "Recycle nbtree pages deleted during same VACUUM" commit
> improved the efficiency of recycling, but I still think that it was a
> bit of a hack. Or at least it didn't go far enough in fixing the old
> design, which is itself a bit of a hack.
>
> As I said back on February 15, a truly good design for nbtree page
> deletion + recycling would have crash safety built in. If page
> deletion itself is crash safe, it really makes sense to make
> everything crash safe (especially because we're managing large chunks
> of equisized free space, unlike in heapam). And as I also said back
> then, a 100% crash-safe design could naturally shift the problem of
> nbtree page recycle safety from the producer/VACUUM side, to the
> consumer/_bt_getbuf() side. It should be completely separated from
> when VACUUM runs, and what VACUUM can discover about recycle safety in
> passing, at the end.
>
> That approach would completely eliminate the need to do any work in
> btvacuumcleanup(), which would make it natural to remove
> VACUUM_OPTION_PARALLEL_COND_CLEANUP from nbtree -- the implementation
> of btvacuumcleanup() would just look like hashvacuumcleanup() does now
> -- it could do practically nothing, making this 100% okay.
>
> For now I have my doubts that it is appropriate to make this change.
> It seems as if the question of whether or not
> VACUUM_OPTION_PARALLEL_COND_CLEANUP should be used is basically the
> same question as "Does the vacuumcleanup() callback for this index AM
> look exactly like hashvacuumcleanup()?".
>
> > IMO we can live with the current
> > configuration just in case where the user runs into such rare
> > situations (especially for the latter case). In most cases, parallel
> > vacuum workers for index cleanup might exit with no-op but the
> > side-effect (wasting resources and overhead etc) would not be big. If
> > we want to enable it only in particular cases, we would need to have
> > another way for index AM to tell lazy vacuum whether or not to allow a
> > parallel worker to process the index at that time. What do you think?
>
> I am concerned about unintended consequences, like never noticing that
> we should really recycle known deleted pages not yet placed in the FSM
> (it's hard to think through very rare cases like this with
> confidence). Is it really so bad if we launch parallel workers that we
> don't really need for a parallel VACUUM?

I don't think it's too bad even if we launch parallel workers for
indexes that don’t really need to be processed by parallel workers.
Parallel workers exit immediately after all indexes are vacuumed so it
would not affect other parallel operations. There is nothing change in
terms of  in terms of DSM usage since btree indexes support parallel
bulkdelete.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/