Re: New strategies for freezing, advancing relfrozenxid early - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: New strategies for freezing, advancing relfrozenxid early
Date
Msg-id CAH2-Wzm_=mrWO+kUAJbR_gM_6RzpwVA8n8e4nh3dJGHdw_urew@mail.gmail.com
Whole thread Raw
In response to Re: New strategies for freezing, advancing relfrozenxid early  (Andres Freund <andres@anarazel.de>)
Responses Re: New strategies for freezing, advancing relfrozenxid early
List pgsql-hackers
On Tue, Nov 15, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:
> > Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.

Attached is v7, which incorporates much of your feedback. Thanks for the review!

> > +/*
> > + * State used by VACUUM to track what the oldest extant XID/MXID will become
> > + * when determing whether and how to freeze a page's heap tuples via calls to
> > + * heap_prepare_freeze_tuple.
>
> Perhaps this could say something like "what the oldest extant XID/MXID
> currently is and what it would be if we decide to freeze the page" or such?

Fixed.

> > + * The relfrozenxid_out and relminmxid_out fields are the current target
> > + * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
>
> "VACUUM caller's heap rel." could stand to be rephrased.

Fixed.

> > + * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
> > + * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
> > + * This includes XIDs that remain as MultiXact members from any tuple's xmax.
> > + * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
> > + * relminmxid_out as needed to avoid unsafe values in rel's authoritative
> > + * pg_class tuple.
> > + *
> > + * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
> > + * relminmxid_nofreeze_out must also be maintained for !freeze pages.
> > + */
>
> relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

Why not? I think that that's exactly what it is. We maintain these
alternative "oldest extant XID" values so that vacuumlazy.c's
lazy_scan_prune function can "opt out" of freezing. This is exactly
the same as what we do in lazy_scan_noprune, both conceptually and at
the implementation level.

> I think it might be better to just always maintain the nofreeze state.

Not sure. Even if there is very little to gain in cycles by not
maintaining the "nofreeze" cutoffs needlessly, it's still a pure waste
of cycles that can easily be avoided. So it just feels natural to not
waste those cycles -- it may even make the design clearer.

> > +typedef struct HeapPageFreeze
> > +{
> > +     /* Is heap_prepare_freeze_tuple caller required to freeze page? */
> > +     bool            freeze;
>
> s/freeze/freeze_required/?

Fixed.

> Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
> in more of them in via HeapPageFreeze?

HeapPageFreeze is supposed to be mutable state used for one single
page, though. Seems like we should use a separate immutable struct for
this instead.

I've already prototyped a dedicated immutable "cutoffs" struct, which
is instantiated exactly once per VACUUM. Seems like a good approach to
me. The immutable state can be shared by heapam.c's
heap_prepare_freeze_tuple(), vacuumlazy.c, and even
vacuum_set_xid_limits() -- so everybody can work off of the same
struct directly. Will try to get that into shape for the next
revision.

> What does 'xtrack' stand for? Xid Tracking?

Yes.

> >   * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
> >   * returned true for when called.  A later heap_freeze_execute_prepared call
> > - * will execute freezing for caller's page as a whole.
> > + * will execute freezing for caller's page as a whole.  Caller should also
> > + * initialize xtrack fields for page as a whole before calling here with first
> > + * tuple for the page.  See page_frozenxid_tracker comments.
>
> s/should/need to/?

Changed it to "must".

> page_frozenxid_tracker appears to be a dangling pointer.

I think that you mean that the code comments reference an obsolete
type name -- fixed.

> > +      * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
> > +      * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
>
> Hm. Perhaps we should just rename them if it requires this kind of
> explanation? They're really not good names.

Agreed -- this can be taken care of as part of using a new VACUUM
operation level struct that is passed as immutable state, which I went
into a moment ago. That centralizes the definitions, which makes it
far easier to understand which cutoff is which. For now I've kept the
names as they were.

> Could use TransactionIdOlder().

I suppose, but the way I've done it feels a bit more natural to me,
and appears more often elsewhere. Not sure.

> > @@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> >                        */
> >                       Assert(!freeze_xmax);
> >                       Assert(TransactionIdIsValid(newxmax));
> > -                     if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
> > -                             *relfrozenxid_out = newxmax;
> > +                     Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
> > +
&xtrack->relfrozenxid_nofreeze_out,
> > +
&xtrack->relminmxid_nofreeze_out));
> > +                     if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
> > +                             xtrack->relfrozenxid_out = newxmax;
>
> Perhaps the Assert(heap_tuple_would_freeze()) bit could be handled once at the
> end of the routine, for all paths?

The problem with that is that we cannot Assert() when we're removing a
Multi via FRM_INVALIDATE_XMAX processing in certain cases (I tried it
this way myself, and the assertion fails there). This can happen when
the call to FreezeMultiXactId() for the xmax determined that we should
do FRM_INVALIDATE_XMAX processing for the xmax due to the Multi being
"isLockOnly" and preceding "OldestVisibleMXactId[MyBackendId])". Which
is relatively common.

I fixed this by moving the assert further down, while still only
checking the FRM_RETURN_IS_XID and FRM_RETURN_IS_MULTI cases.

> Oh - I totally didn't realize that ->freeze is an out parameter. Seems a bit
> odd to have the other fields suffied with _out but not this one?

Fixed this by not having an "_out" suffix for any of these mutable
fields from HeapPageFreeze. Now everything is consistent. (The "_out"
convention is totally redundant, now that we have the HeapPageFreeze
struct, which makes it obvious that it is is all mutable state.)

> Won't using OldestXmin instead of FreezeLimit potentially cause additional
> conflicts? Is there any reason to not compute an accurate value?

This is a concern that I share. I was hoping that I'd be able to get
away with using OldestXmin just for this, because it's simpler that
way. But I had my doubts about it already.

I wonder why it's correct to use FreezeLimit for this on HEAD, though.
What about those FRM_INVALIDATE_XMAX cases that I just mentioned we
couldn't Assert() on? That case effectively removes XIDs that might be
well after FreezeLimit. Granted it might be safe in practice, but it's
far from obvious why it is safe.

Perhaps we can fix this in a not-too-invasive way by reusing
LVPagePruneState.visibility_cutoff_xid for FREEZE_PAGE conflicts (not
just VISIBLE conflicts) in cases where that was possible (while still
using OldestXmin as a fallback in much rarer cases). In practice we're
only triggering freezing eagerly because the page is already expected
to be set all-visible (the whole point is that we'd prefer if it was
set all-frozen instead of all-visible).

(I've not done this in v7, but it's on my TODO list.)

Note that the patch already maintains
LVPagePruneState.visibility_cutoff_xid when there are some LP_DEAD
items on the page, because we temporarily ignore those LP_DEAD items
when considering the eager freezing stuff......

> >               if (ItemIdIsDead(itemid))
> >               {

> >                       deadoffsets[lpdead_items++] = offnum;
> > -                     prunestate->all_visible = false;
> > -                     prunestate->has_lpdead_items = true;
> >                       continue;
> >               }
>
> What does this have to do with the rest of the commit? And why are we doing
> this?

....which is what you're asking about here.

The eager freezing strategy triggers page-level freezing for any page
that is about to become all-visible, so that it can be set all-frozen
instead. But that's not entirely straightforward when there happens to
be some LP_DEAD items on the heap page. There are really two ways that
a page can become all-visible during VACUUM, and we want to account
for that here. With eager freezing we want to make the pages become
all-frozen instead of just all-visible, regardless of which heap pass
(first pass or second pass) the page is set to become all-visible (and
maybe even all-frozen).

The comments that you mention were moved around a bit in passing.

Note that we still set prunestate->all_visible to false inside
lazy_scan_prune when we see remaining LP_DEAD stub items. We just do
it later on, after we've decided on freezing stuff. (Obviously it
wouldn't be okay to return to lazy_scan_heap without unsetting
prunestate->all_visible if there are LP_DEAD items.)

> Seems quite confusing to enter a block with described as "We're freezing the
> page." when we're not freezing anything (tuples_frozen == 0).

> I don't really get what freeze_all_eligible is trying to do.

freeze_all_eligible (and the "tuples_frozen == 0" behavior) are both
there because we can mark a page as all-frozen in the VM without
freezing any of its tuples first. When that happens, we must make sure
that "prunestate->all_frozen" is set to true, so that we'll actually
set the all-frozen bit. At the same time, we need to be careful about
the case where we *could* set the page all-frozen if we decided to
freeze all eligible tuples -- we need to handle the case where we
choose against freezing (and so can't set the all-frozen bit in the
VM, and so must actually set "prunestate->all_frozen" to false).

This is all kinda tricky because we're simultaneously dealing with the
actual state of the page, and the anticipated state of the page in the
near future. Closely related concepts, but distinct in important ways.

> >  #ifdef USE_ASSERT_CHECKING
> >       /* Note that all_frozen value does not matter when !all_visible */
> > -     if (prunestate->all_visible)
> > +     if (prunestate->all_visible && lpdead_items == 0)
> >       {
> >               TransactionId cutoff;
> >               bool            all_frozen;
> > @@ -1849,8 +1876,7 @@ retry:
> >               if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
> >                       Assert(false);
>
> Not related to this change, but why isn't this just
> Assert(heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))?

It's just a matter of personal preference. I prefer to have a clear
block of related code that contains multiple related assertions. You
would probably have declared PG_USED_FOR_ASSERTS_ONLY variables at the
top of lazy_scan_prune instead. FWIW if you did it the other way the
assertion would actually have to include a "!prunestate->all_visible"
test that short circuits the heap_page_is_all_visible() call from the
Assert().

> > Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.

> This should include a description of the memory usage effects.

The visibilitymap.c side of this is the least worked out part of the
patch series, by far. I have deliberately put off work on the data
structure itself, preferring to focus on the vacuumlazy.c side of
things for the time being. But I still agree -- fixed by acknowledging
that that particular aspect of resource management is unresolved.

I did have an open TODO before in the commit message, which is now
improved based on your feedback: it now fully owns the fact that we
really ignore the impact on memory usage right now. Just because that
part is very WIP (much more so than every other part).

> > This has significant advantages over the previous approach of using the
> > authoritative VM fork to decide on which pages to skip.  The number of
> > heap pages processed will no longer increase when some other backend
> > concurrently modifies a skippable page, since VACUUM will continue to
> > see the page as skippable (which is correct because the page really is
> > still skippable "relative to VACUUM's OldestXmin cutoff").
>
> Why is it an advantage for the number of pages to not increase?

The commit message goes into that immediately after the last line that
you quoted.  :-)

Having an immutable structure will help us, both in the short term,
for this particular project, and the long term. for other VACUUM
enhancements.

We need to have something that drives the cost model in vacuumlazy.c
for the skipping strategy stuff -- we need to have advanced
information about costs that drive the decision making process. Thanks
to VM snapshots, the cost model is able to reason about the cost of
relfrozenxid advancement precisely, in terms of "extra" scanned_pages
implied by advancing relfrozenxid during this VACUUM. That level of
precision is pretty nice IMV. It's not strictly necessary, but it's
nice to be able to make a precise accurate comparison between each of
the two skipping strategies.

Did you happen to look at the 6th and final patch? It's trivial, but
can have a big impact. It sizes dead_items while capping its sized
based on scanned_pages, not based on rel_pages. That's obviously
guaranteed to be correct.  Note also that the 2nd patch teaches VACUUM
VERBOSE to report the final number of scanned_pages right at the
start, before scanning anything -- so it's a useful basis for much
better progress reporting in pg_stat_progress_vacuum. Stuff like that
also becomes very easy with VM snapshots.

Then there is the more ambitious stuff, that's not in scope for this
project. Example: Perhaps Sawada san will be able to take the concept
of visibility map snapshots, and combine it with his Radix tree design
-- which could presumably benefit from advanced knowledge of which
pages can be scanned. This is information that is reliable, by
definition. In fact I think that it would make a lot of sense for this
visibility map snapshot data structure to be exactly the same
structure used to store dead_items. They really are kind of the same
thing. The design can reason precisely about which heap pages can ever
end up having any LP_DEAD items. (It's already trivial to use the VM
snapshot infrastructure as a precheck cache for dead_items lookups.)

> > Non-aggressive VACUUMs now explicitly commit to (or decide against)
> > early relfrozenxid advancement up-front.
>
> Why?

We can advance relfrozenxid because it's cheap to, or because it's
urgent (according to autovacuum_freeze_max_age). This is kind of true
on HEAD already due to the autovacuum_freeze_max_age "escalate to
aggressive" thing -- but we can do much better than that. Why not
decide to advance relfrozenxid when (say) it's only *starting* to get
urgent when it happens to be relatively cheap (though not dirt cheap)?
We make relfrozenxid advancement a deliberate decision that weighs
*all* available information, and has a sense of the needs of the table
over time.

The user experience is important here. Going back to a model where
there is really just one kind of lazy VACUUM makes a lot of sense. We
should have much more approximate guarantees about relfrozenxid
advancement, since that's what gives us the flexibility to find a
cheaper (or more stable) way of keeping up over time. It matters that
we keep up over time, but it doesn't matter if we fall behind on
relfrozenxid advancement -- at least not if we don't also fall behind
on the work of freezing physical heap pages.

> > VACUUM will now either scan
> > every all-visible page, or none at all.  This replaces lazy_scan_skip's
> > SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
> > relfrozenxid advancement (see commit bf136cf6), but left many of the
> > details to chance.
>
> The main goal according to bf136cf6 was to avoid defeating OS readahead, so I
> think it should be mentioned here.

Agreed. Fixed.

> To me this is something that ought to be changed separately from the rest of
> this commit.

Maybe, but I'd say it depends on the final approach taken -- the
visibilitymap.c aspects of the patch are the least settled. I am
seriously considering adding prefetching to the vm snapshot structure,
which would make it very much a direct replacement for
SKIP_PAGES_THRESHOLD.

Separately, I'm curious about what you think of VM snapshots from an
aio point of view. Seems like it would be ideal for prefetching for
aio?

> > TODO: We don't spill VM snapshots to disk just yet (resource management
> > aspects of VM snapshots still need work).  For now a VM snapshot is just
> > a copy of the VM pages stored in local buffers allocated by palloc().
>
> HEAPBLOCKS_PER_PAGE is 32672 with the defaults. The maximum relation size is
> 2**32 - 1 blocks. So the max FSM size is 131458 pages, a bit more than 1GB. Is
> that correct?

I think that you meant "max VM size". That sounds correct to me.

> For large relations that are already nearly all-frozen this does add a
> noticable amount of overhead, whether spilled to disk or not. Of course
> they're also not going to be vacuumed super often, but ...

I wouldn't be surprised if the patch didn't work with relations that
approach 32 TiB in size. As I said, the visibilitymap.c data structure
is the least worked out piece of the project.

> Perhaps worth turning the VM into a range based description for the snapshot,
> given it's a readonly datastructure in local memory? And we don't necessarily
> need the all-frozen and all-visible in memory, one should suffice? We don't
> even need random access, so it could easily be allocated incrementally, rather
> than one large allocation.

Definitely think that we should do simple run-length encoding, stuff
like that. Just as long as it allows vacuumlazy.c to work off of a
true snapshot, with scanned_pages known right from the start. The
consumer side of things has been my focus so far.

> Hm. It's a bit sad to compute the snapshot after determining OldestXmin.
>
> We probably should refresh OldestXmin periodically. That won't allow us to get
> a more aggressive relfrozenxid, but it'd allow to remove more gunk.

That may well be a good idea, but I think that it's also a good idea
to just not scan heap pages that we know won't have XIDs < OldestXmin
(OldestXmin at the start of the VACUUM). That visibly makes the
problem of "recently dead" tuples that cannot be cleaned up a lot
better, without requiring that we do anything with OldestXmin.

I also think that there is something to be said for not updating the
FSM for pages that were all-visible at the beginning of the VACUUM
operation. VACUUM is currently quite happy to update the FSM with its
own confused idea about how much free space there really is on heap
pages with recently dead (dead but not yet removable) tuples. That's
really bad, but really subtle.

> What does it mean to "skip lazily"?

Skipping even all-visible pages, prioritizing avoiding work over
advancing relfrozenxid. This is a cost-based decision. As I mentioned
a moment ago, that's one immediate use of VM snapshots (it gives us
precise information to base our decision on, that simply *cannot*
become invalid later on).

> > +             /*
> > +              * Visibility map page copied to local buffer for caller's snapshot.
> > +              * Caller requires an exact count of all-visible and all-frozen blocks
> > +              * in the heap relation.  Handle that now.
>
> This part of the comment seems like it actually belongs further down?

No, it just looks a bit like that because of the "truncate in-memory
VM" code stanza. It's actually the right order.

> > +              * Must "truncate" our local copy of the VM to avoid incorrectly
> > +              * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
> > +              * this by clearing irrelevant bits on the last VM page copied.
> > +              */
>
> Hm - why would those bits already be set?

No real reason, we "truncate" like this defensively. This will
probably look quite different before too long.

> > Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.

> What's the logic behind a hard threshold?  Suddenly freezing everything on a
> huge relation seems problematic. I realize that never getting all that far
> behind is part of the theory, but I don't think that's always going to work.

It's a vast improvement on what we do currently, especially in
append-only tables.

There is simply no limit on how many physical heap pages will have to
be frozen when there is an aggressive mode VACUUM. It could be
terabytes, since table age predicts precisely nothing about costs.
With the patch we have a useful limit for the first time, that uses
physical units (the only kind of units that make any sense).

Admittedly we should really have special instrumentation that reports
when VACUUM must do "catch up freezing" when the
vacuum_freeze_strategy_threshold threshold is first crossed, to help
users to make better choices in this area. And maybe
vacuum_freeze_strategy_threshold should be lower by default, so it's
not as noticeable. (The GUC partly exists as a compatibility option, a
bridge to the old lazy behavior.)

Freezing just became approximately 5x cheaper with the freeze plan
deduplication work (commit 9e540599). To say nothing about how
vacuuming indexes became a lot cheaper in recent releases. So to some
extent we can afford to be more proactive here. There are some very
nonlinear cost profiles involved here due to write amplification
effects. So having a strong bias against write amplification seems
totally reasonable to me -- we can potentially "get it wrong" and
still come out ahead, because we at least had the right idea about
costs.

I don't deny that there are clear downsides, though. I am convinced
that it's worth it -- performance stability is what users actually
complain about in almost all cases. Why should performance stability
be 100% free?

> Wouldn't a better strategy be to freeze a percentage of the relation on every
> non-aggressive vacuum? That way the amount of work for an eventual aggressive
> vacuum will shrink, without causing individual vacuums to take extremely long.

I think that it's better to avoid aggressive mode altogether. By
committing to advancing relfrozenxid by *some* amount in ~all VACUUMs
against larger tables, we can notice when we don't actually need to do
very much freezing to keep relfrozenxid current, due to workload
characteristics. It depends on workload, of course. But if we don't
try to do this we'll never notice that it's possible to do it.

Why should we necessarily need to freeze very much, after a while? Why
shouldn't most newly frozen pages stay frozen ~forever after a little
while?

> The other thing that I think would be to good to use is a) whether the page is
> already in s_b, and b) whether the page already is dirty. The cost of freezing
> shrinks significantly if it doesn't cause an additional read + write. And that
> additional IO IMO one of the major concerns with freezing much more
> aggressively in OLTPish workloads where a lot of the rows won't ever get old
> enough to need freezing.

Maybe, but I think that systematic effects are more important. We
freeze eagerly during this VACUUM in part because it makes
relfrozenxid advancement possible in the next VACUUM.

Note that eager freezing doesn't freeze the page unless it's already
going to set it all-visible. That's another way in which we ameliorate
the problem of freezing when it makes little sense to -- even with
eager freezing strategy, we *don't* freeze heap pages where it
obviously makes little sense to. Which makes a huge difference on its
own.

There is good reason to believe that most individual heap pages are
very cold data, even in OLTP apps. To a large degree Postgres is
successful because it is good at inexpensively storing data that will
possibly never be accessed:


https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/

Speaking of OLTP apps:

in many cases VACUUM will prune just to remove one or two heap-only
tuples, maybe even generating an FPI in the process. But the removed
tuple wasn't actually doing any harm -- an opportunistic prune could
have done the same thing later on, once we'd built up some more
garbage tuples. So the only reason to prune is to freeze the page. And
yet right now we don't recognize this and freeze the page to get
*some* benefit out of the arguably needlessly prune. This is quite
common, in fact.

> The most significant aspect of anti-wrap autvacuums right now is that they
> don't auto-cancel. Is that still used? If so, what's the threshold?

This patch set doesn't change anything about antiwraparound
autovacuums -- though it does completely eliminate aggressive mode (so
it's a little like Postgres 8.4).

There is a separate thread discussing the antiwraparound side of this, actually:

https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com

I think that I will need to invent a new type of autovacuum that's
similar to antiwraparound autovacuum, but doesn't have the
cancellation behavior -- that is more or less prerequisite to
committing this patch series. We can accept some risk of relfrozenxid
falling behind if that doesn't create any real risk of antiwraparound
autovacuums.

We can retain antiwraparound autovacuum, which should kick in only
when the new kind of autovacuum has failed to advance relfrozenxid,
having had the opportunity. Maybe antiwraparound autovacuum should be
triggered when age(relfrozenxid) is twice the value of
autovacuum_freeze_max_age. The new kind of autovacuum would trigger at
the same table age that triggers antiwraparound autovacuum with the
current design.

So antiwraparound autovacuum would work in the same way, but would be
much less common -- even for totally static tables. We'd at least be
sure that the auto cancellation behavior was *proportionate* to the
problem at hand, because we'll always have tried and ultimately failed
to advance relfrozenxid without activating the auto cancellation
behavior. We wouldn't trigger a very disruptive behavior routinely,
without any very good reason.

> > Now every VACUUM might need to wait for a cleanup lock, though few will.
> > It can only happen when required to advance relfrozenxid to no less than
> > half way between the existing relfrozenxid and nextXID.
>
> Where's that "halfway" bit coming from?

We don't use FreezeLimit within lazy_scan_noprune in the patch that
gets rid of aggressive mode VACUUM. We use something called minXid in
its place. So a different timeline to freezing (even for tables where
we always use the lazy freezing strategy).

The new minXid cutoff (used by lazy_scan_noprune) comes from this
point in vacuum_set_xid_limits():

+ *minXid = nextXID - (freeze_table_age / 2);
+ if (!TransactionIdIsNormal(*minXid))
+     *minXid = FirstNormalTransactionId;

So that's what I meant by "half way".

(Note that minXid is guaranteed to be <= FreezeLimit, which is itself
guaranteed to be <= OldestXmin, no matter what.)

> Isn't "half way between the relfrozenxid and nextXID" a problem for instances
> with longrunning transactions?

Should we do less relfrozenxid advancement because there is a long
running transaction, though? It's obviously seriously bad when things
are blocked by a long running transaction, but I don't see the
connection between that and how we wait for cleanup locks. Waiting for
cleanup locks is always really, really bad, and can be avoided in
almost all cases.

I suspect that I still haven't been aggressive enough in how minXid is
set, BTW -- we should be avoiding waiting for a cleanup lock like the
plague. So "half way" isn't enough. Maybe we should have a LOG message
in cases where it actually proves necessary to wait, because it's just
asking for trouble (at least when we're running in autovacuum).

> Wouldn't this mean that wait for every page if
> relfrozenxid can't be advanced much because of a longrunning query or such?

Old XIDs always start out as young XIDs. Which we're now quite willing
to freeze when conditions look good.

Page level freezing always freezes all eligible XIDs on the page when
triggered, no matter what the details may be. This means that the
oldest XID on a heap page is more or less always an XID that's after
whatever OldestXmin was for the last VACUUM that ran and froze the
page, whenever that happened, and regardless of the mix of XID ages
was on the page at that time.

As a consequence, lone XIDs that are far older than other XIDs on the
same page become much rarer than what you'd see with the current
design -- they have to "survive" multiple VACUUMs, not just one
VACUUM. The best predictor of XID age becomes the time that VACUUM
last froze the page as a whole -- so workload characteristics and
natural variations are much much less likely to lead to problems from
waiting for cleanup locks. (Of course it also helps that we'll try
really hard to do that, and almost always prefer lazy_scan_noprune
processing.)

There is some sense in which we're trying to create a virtuous cycle
here. If we are always in a position to advance relfrozenxid by *some*
amount each VACUUM, however small, then we will have many individual
opportunities (spaced out over multiple VACUUM operations) to freeze
tuples on any heap tuples that (for whatever reason) are harder to get
a cleanup lock on, and then catch up on relfrozenxid by a huge amount
whenever we "get lucky". We have to "keep an open mind" to ever have
any chance of "getting lucky" in this sense, though.

> > VACUUM is the only
> > mechanism that can claw back MultixactId space, so allowing VACUUM to
> > consume MultiXactId space (for any reason) adds to the risk that the
> > system will trigger the multiStopLimit wraparound protection mechanism.
>
> Strictly speaking that's not quite true, you can also drop/truncate tables ;)

Fixed.

--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Next
From: Tom Lane
Date:
Subject: Re: test/modules/test_oat_hooks vs. debug_discard_caches=1