Re: 64-bit XIDs in deleted nbtree pages - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: 64-bit XIDs in deleted nbtree pages
Date
Msg-id CAH2-WznpMJ6zJB9Oqx2itZEv060Wu3wtXqcSNDeANW_M9FK8ZQ@mail.gmail.com
Whole thread Raw
In response to Re: 64-bit XIDs in deleted nbtree pages  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: 64-bit XIDs in deleted nbtree pages
Re: 64-bit XIDs in deleted nbtree pages
List pgsql-hackers
On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for working on this!

I'm glad that I finally found time for it! It seems like it'll make
things easier elsewhere.

Attached is v3 of the index. I'll describe the changes I made in more
detail in my response to your points below.

> I agree that btm_oldest_btpo_xact will no longer be necessary in terms
> of recycling deleted pages.

Cool.

> Given that we can guarantee that deleted pages never be leaked by
> using 64-bit XID, I also think we don't need this value. We can do
> amvacuumcleanup only if the table receives enough insertions to update
> the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
> think this is a more desirable behavior. Not skipping amvacuumcleanup
> if there is even one deleted page that we can recycle is very
> conservative.
>
> Considering your idea of keeping newly deleted pages in the meta page,
> I can see a little value that keeping btm_oldest_btpo_xact and making
> it 64-bit XID. I described below.

> Interesting.
>
> I like this idea that triggers btvacuumscan() if there are many newly
> deleted pages. I think this would be helpful especially for the case
> of bulk-deletion on the table. But why we use the number of *newly*
> deleted pages but not the total number of deleted pages in the index?

I was unclear here -- I should not have said "newly deleted" pages at
all. What I actually do when calling _bt_vacuum_needs_cleanup() is
this (from v3, at the end of btvacuumscan()):

-   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+   Assert(stats->pages_deleted >= stats->pages_free);
+   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
                                 info->num_heap_tuples);

We're actually passing something I have called
"pages_deleted_not_free" here, which is derived from the bulk delete
stats in the obvious way that you see here (subtraction). I'm not
using pages_newly_deleted at all now. Note also that the behavior
inside _bt_update_meta_cleanup_info() no longer varies based on
whether it is called during btvacuumcleanup() or during btbulkdelete()
-- the same rules apply either way. We want to store
pages_deleted_not_free in the metapage at the end of btvacuumscan(),
no matter what.

This same pages_deleted_not_free information is now used by
_bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
high (over 2.5%), then that will trigger a call to btbulkdelete() (we
won't skip scanning the index). Though in practice it probably won't
come up that often -- there just aren't ever that many deleted pages
in most indexes.

> IIUC if several btbulkdelete executions deleted index pages less than
> 2% of the index and those deleted pages could not be recycled yet,
> then the number of recyclable pages would exceed 2% of the index in
> total but amvacuumcleanup() would not trigger btvacuumscan() because
> the last newly deleted pages are less than the 2% threshold. I might
> be missing something though.

I think you're right -- my idea of varying the behavior of
_bt_update_meta_cleanup_info() based on whether it's being called
during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
half the problem was that I explained the idea badly to begin with).
But, as I said, it's fixed in v3: we simply pass
"pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
now.

Does that make sense? Does it address this concern?

> Also, we need to note that having newly deleted pages doesn't
> necessarily mean these always are recyclable at that time. If the
> global xmin is still older than deleted page's btpo.xact values, we
> still could not recycle them. I think btm_oldest_btpo_xact probably
> will help this case. That is, we store the oldest btpo.xact among
> those newly deleted pages to btm_oldest_btpo_xact and we trigger
> btvacuumscan() if there are many newly deleted pages (more than 2% of
> index) and the btm_oldest_btpo_xact is older than the global xmin (I
> suppose each newly deleted pages could have different btpo.xact).

I agree that having no XID in the metapage creates a new small
problem. Specifically, there are certain narrow cases that can cause
confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
exist before my patch (kind of).

The simplest example is easy to run into when debugging the patch on
your laptop. Because you're using your personal laptop, and not a real
production server, there will be no concurrent sessions that might
consume XIDs. You can run VACUUM VERBOSE manually several times, but
that alone will never be enough to enable VACUUM to recycle any of the
pages that the first VACUUM manages to delete (many to mark deleted,
reporting the pages as "newly deleted" via the new instrumentation
from the second patch). Note that the master branch is *also* unable
to recycle these deleted pages, simply because the "safe xid" never
gets old because there are no newly allocated XIDs to make it look old
(there are no allocated XIDs just because nothing else happens). That
in itself is not the new problem.

The new problem is that _bt_vacuum_needs_cleanup() will no longer
notice that the oldest XID among deleted-but-not-yet-recycled pages is
so old that it will not be able to recycle the pages anyway -- at
least not the oldest page, though in this specific case that will
apply to all deleted pages equally. We might as well not bother trying
yet, which the old code "gets right" -- but it doesn't get it right
for any good reason. That is, the old code won't have VACUUM scan the
index at all, so it "wins" in this specific scenario.

I think that's okay, though -- it's not a real problem, and actually
makes sense and has other advantages. This is why I believe it's okay:

* We really should never VACUUM the same table before even one or two
XIDs are allocated -- that's what happens in the simple laptop test
scenario that I described. Surely we should not be too concerned about
"doing the right thing" under this totally artificial set of
conditions.

(BTW, I've been using txid_current() for my own "laptop testing", as a
way to work around this issue.)

* More generally, if you really can't do recycling of pages that you
deleted during the last VACUUM during this VACUUM (perhaps because of
the presence of a long-running xact that holds open a snapshot), then
you have lots of *huge* problems already, and this is the least of
your concerns. Besides, at that point an affected VACUUM will be doing
work for an affected index through a btbulkdelete() call, so the
behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.

* As you pointed out already, the oldest XID/deleted page from the
index may be significantly older than the newest. Why should we bucket
them together?

We could easily have a case where most of the deleted pages can be
recycled -- even when all indexes were originally marked deleted by
the same VACUUM operation. If there are lots of pages that actually
can be recycled, it is probably a bad thing to assume that the oldest
XID is representative of all of them. After all, with the patch we
only go out of our way to recycle deleted pages when we are almost
sure that the total number of recyclable pages (pages marked deleted
during a previous VACUUM) exceeds 2.5% of the total size of the index.
That broad constraint is important here -- if we do nothing unless
there are lots of deleted pages anyway, we are highly unlikely to ever
err on the side of being too eager (not eager enough seems more likely
to me).

I think that we're justified in making a general assumption inside
_bt_vacuum_needs_cleanup() (which is documented at the point that we
call it, inside btvacuumscan()): The assumption that however many
index pages the metapage says we'll be able to recycle (whatever the
field says) will in fact turn out to be recyclable if we decide that
we need to. There are specific cases where that will be kind of wrong,
as I've gone into, but the assumption/design has many more advantages
than disadvantages.

I have tried to capture this in v3 of the patch. Can you take a look?
See the new comments inside _bt_vacuum_needs_cleanup(). Plus the
comments when we call it inside btvacuumscan().

Do you think that those new comments are helpful? Does this address
your concern?

Thanks
-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Single transaction in the tablesync worker?
Next
From: Peter Geoghegan
Date:
Subject: Re: 64-bit XIDs in deleted nbtree pages