Re: xid wraparound danger due to INDEX_CLEANUP false - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: xid wraparound danger due to INDEX_CLEANUP false
Date
Msg-id CAH2-Wzn8heFbG=r+vjWeiuHcxMQ3i2kcLFqgK76ZMKZqR9tqig@mail.gmail.com
Whole thread Raw
In response to Re: xid wraparound danger due to INDEX_CLEANUP false  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Thu, Apr 16, 2020 at 3:49 PM Andres Freund <andres@anarazel.de> wrote:
> I think we really just stop being miserly and update the xid to be
> FrozenTransactionId or InvalidTransactionId when vacuum encounters one
> that's from before the the xid cutoff used by vacuum (i.e. what could
> become the new relfrozenxid).  That seems like it'd be a few lines, not
> more.

Okay.

> > There is absolutely no reason why we have to delay recycling for very
> > long, even in cases with long running transactions or whatever. I
> > agree that it's just an accident that it works that way. VACUUM could
> > probably remember deleted pages, and then revisited those pages at the
> > end of the index vacuuming -- that might make a big difference in a
> > lot of workloads. Or it could chain them together as a linked list
> > which can be accessed much more eagerly in some cases.
>
> I think it doesn't really help meaningfully for vacuum to be a bit
> smarter about when to recognize pages as being recyclable. IMO the big
> issue is that vacuum won't be very frequent, so we'll grow the index
> until that time, even if there's many "effectively empty" pages.

(It seems like you're talking about the big picture now, not the
problems in Postgres 11 and 12 features in this area -- you're talking
about what happens to empty pages, not what happens to deleted pages.)

I'll say some more things about the less ambitious goal of more eager
recycling of pages, as they are deleted:

An individual VACUUM operation cannot recycle a page right after
_bt_pagedel() is called to delete the page. VACUUM will both set a
target leaf page half dead and delete it all at once in _bt_pagedel()
(it does that much in the simple and common case). Again, this is
because recycling very soon after the call to _bt_pagedel() will
introduce races with concurrent index scans -- they could fail to
observe the deletion in the parent (i.e. see its downlink, since child
isn't even half dead), and land on a concurrently recycled page
(VACUUM concurrently marks the page half-dead, fully dead/deleted, and
then even goes as far as recycling it). So the current design makes a
certain amount of sense -- we can't be super aggressive like that.
(Actually, maybe it doesn't make sense to not just put the page in the
FSM there and then -- see "Thinks some more" below.)

Even still, nothing stops the same VACUUM operation from (for example)
remembering a list of pages it has deleted during the current scan,
and then coming back at the end of the bulk scan of the index to
reconsider if it can recycle the pages now (2 minutes later instead of
2 months later). With a new RecentGlobalXmin (or something that's
conceptually like a new RecentGlobalXmin).

Similarly, we could do limited VACUUMs that only visit previously
deleted pages, once VACUUM is taught to chain deleted pages together
to optimize recycling. We don't have to repeat another pass over the
entire index to recycle the pages because of this special deleted page
linking. This is something that we use when we have to recycle pages,
but it's a " INDEX_CLEANUP = off" index VACUUM -- we don't really want
to do most of the stuff that index vacuuming needs to do, but we must
still visit the metapage to check btm_oldest_btpo_xact, and then maybe
walk the deleted page linked list.

*** Thinks some more ***

As you pointed out, _bt_getbuf() already distrusts the FSM -- it has
its own _bt_page_recyclable() check, probably because the FSM isn't
crash safe. Maybe we could improve matters by teaching _bt_pagedel()
to put a page it deleted in the FSM immediately -- no need to wait
until the next index VACUUM for the RecordFreeIndexPage() call. It
still isn't quite enough that _bt_getbuf() distrusts the FSM, so we'd
also have to teach _bt_getbuf() some heuristics that made it
understand that VACUUM is now designed to put stuff in the FSM
immediately, so we don't have to wait for the next VACUUM operation to
get to it. Maybe _bt_getbuf() should try the FSM a few times before
giving up and allocating a new page, etc.

This wouldn't make VACUUM delete any more pages any sooner, but it
would make those pages reclaimable much sooner. Also, it wouldn't
solve the wraparound problem, but that is a bug, not a
performance/efficiency issue.

> I.e. even if the killtuples logic allows us to recognize that all actual
> index tuples are fully dead, we'll not benefit from that unless there's
> a new insertion that belongs onto the "empty" page. That's fine for
> indexes that are updated roughly evenly across the value range, but
> terrible for indexes that "grow" mostly on one side, and "shrink" on the
> other.

That could be true, but there are certain things about B-Tree space
utilization that might surprise you:

https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3

> I'd bet it be beneficial if we were to either have scans unlink such
> pages directly, or if they just entered the page into the FSM and have
> _bt_getbuf() do the unlinking.

That won't work, since you're now talking about pages that aren't
deleted (or even half-dead) that are just candidates to be deleted
because they're empty. So you'd have to do all the steps in
_bt_pagedel() within a new _bt_getbuf() path, which would have many
deadlock hazards. Unlinking the page from the tree itself (deleting)
is really complicated.

> I'm not sure if the current locking
> model assumes anywhere that there is only one process (vacuum) unlinking
> pages though?

I'm not sure, though _bt_unlink_halfdead_page() has comments supposing
that there could be concurrent page deletions like that.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Allow pg_read_all_stats to read pg_stat_progress_*
Next
From: Michael Paquier
Date:
Subject: Re: [PATHC] Fix minor memory leak in pg_basebackup