Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Date
Msg-id CAH2-Wzm2fpPQ_=pXpRvkNiuTYBGTAUfxRNW40kLitxj9T3Ny7w@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
List pgsql-hackers
On Sat, Apr 29, 2023 at 7:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
> How about
>
> -HINT:  To avoid a database shutdown, [...]
> +HINT:  To prevent XID exhaustion, [...]
>
> ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs
havebeen exhausted", etc. 

I think that that particular wording works in this example -- we *are*
avoiding XID exhaustion. But it still doesn't really address my
concern -- at least not on its own. I think that we need a term for
xidStopLimit mode (and perhaps multiStopLimit) itself. This is a
discrete state/mode that is associated with a specific mechanism. I'd
like to emphasize the purpose of xidStopLimit (over when xidStopLimit
happens) in choosing this user-facing name.

As you know, the point of xidStopLimit mode is to give autovacuum the
chance to catch up with managing the XID space through freezing: the
available supply of XIDs doesn't meet present demand, and hasn't for
some time, so it finally came to this. Even if we had true 64-bit XIDs
we'd probably still need something similar -- there would still have
to be *some* point that allowing the "freezing deficit" to continue to
grow just wasn't tenable. If a person consistently spends more than
they take in, their "initial bankroll" isn't necessarily relevant. If
our ~2.1 billion XID "bankroll" wasn't enough to avoid xidStopLimit,
why would we expect 8 billion or 20 billion XIDs to have been enough?

I'm thinking of a user-facing name for xidStopLimit along the lines of
"emergency XID allocation restoration mode" (admittedly that's quite a
mouthful). Something that carries the implication of "imbalance". The
system was configured in a way that turned out to be unsustainable.
The system was therefore forced to "restore sustainability" using the
only tool that remained. This is closely related to the failsafe.

As bad as xidStopLimit is, it won't always be the end of the world --
much depends on individual application requirements.

> (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when
theyarrived also in 2005). 
>
> > Separately, there is a need to update a couple of other places to use
> > this new terminology. The documentation for vacuum_sailsafe_age and
> > vacuum_multixact_failsafe_age refer to "system-wide transaction ID
> > wraparound failure", which seems less than ideal, even without your
> > patch.
>
> Right, I'll have a look.

As you know, there is a more general problem with the use of the term
"wraparound" in the docs, and in the system itself (in places like
pg_stat_activity). Even the very basic terminology in this area is
needlessly scary. Terms like "VACUUM (to prevent wraparound)" are
uncomfortably close to "a race against time to avoid data corruption".
The system isn't ever supposed to corrupt data, even if misconfigured
(unless the misconfiguration is very low-level, such as "fsync=off").
Users should be able to take that much for granted.

I don't expect either of us to address that problem in the short term
-- the term "wraparound" is too baked-in for it to be okay to just
remove it overnight. But, it could still make sense for your patch (or
my own) to fully own the fact that "wraparound" is actually a
misnomer. At least when used in contexts like "to prevent wraparound"
(xidStopLimit actually "prevents wraparound", though we shouldn't say
anything about it in a place of prominence). Let's reassure users that
they should continue to take "we won't corrupt your data for no good
reason" for granted.

> I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion.

I had planned to address this with my ongoing work on the "Routine
Vacuuming" docs, but I think that you're right about the necessity of
addressing it as part of this patch.

These extra steps will be required whenever the problem is a leaked
prepared transaction, or something along those lines. That is
increasingly likely to turn out to be the underlying cause of entering
xidStopLimit, given the work we've done on VACUUM over the years. I
still think that "imbalance" is the right way to frame discussion of
xidStopLimit. After all, autovacuum/VACUUM will still spin its wheels
in a futile effort to "restore balance". So it's kinda still about
restoring imbalance IMV.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: Possible regression setting GUCs on \connect
Next
From: Thomas Munro
Date:
Subject: Re: Direct I/O