Re: What is "wraparound failure", really? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: What is "wraparound failure", really?
Date
Msg-id CA+TgmoZsvdo42wsviLCiK2bUuA3ge0V7=SNNh8sx1WgZxNpEkw@mail.gmail.com
Whole thread Raw
In response to Re: What is "wraparound failure", really?  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: What is "wraparound failure", really?
List pgsql-hackers
On Mon, Jun 28, 2021 at 8:52 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> But if you're really worried about people setting
> autovacuum_freeze_max_age too high, then maybe we should be talking
> about capping it at a lower level rather than adjusting the docs that
> most users don't read.

The problem is that the setting is measuring something that is a
pretty poor proxy for the thing we actually care about. It's measuring
the XID age at which we're going to start forcing vacuums on tables
that don't otherwise need to be vacuumed, but the thing we care about
is the XID age at which those vacuums are going to *finish*. Now maybe
you think that's a minor difference, and if your tables are small, it
is, but if they're really big, it's not. If you have only tables that
are say 1GB in size and your system is otherwise well-configured, you
could probably crank autovacuum_freeze_max_age up all the way to the
max without a problem. But if you have 1TB tables, you are going to
need a lot more headroom. The exact amount of headroom you need
depends especially on the size of your largest tables, but also on how
well-distributed the relfrozenxid values are, and on the total sizes
of all your tables, on your I/O subsystem, on your XID consumption
rate, on your vacuum delay settings, and on whether you want to make
any allowance for the rare but possible scenario where vacuum dies to
an ERROR. This means that in practice nobody knows whether a
particular setting of autovacuum_freeze_max_age on a particular system
is safe or not, except in the absolutely most obvious cases. Capping
it at a lower level would prevent some people from doing things that
are perfectly safe and still not prevent other people from doing
things that are horribly dangerous.

I think what we really need here is some kind of deadline-based
scheduler. As Peter says, the problem is that we might run out of
XIDs. The system should be constantly thinking about that and taking
appropriate emergency actions to make sure it doesn't happen. Right
now it's really pretty chill about the possibility of looming
disaster. Imagine that you hire a babysitter and tell them to get the
kids out of the house if there's a fire. While you're out, a volcano
erupts down the block. A giant cloud of ash forms and there's lava
everywhere, even touching the house, which begins to smolder, but the
babysitter just sits there and watches TV. As soon as the first flames
appear, the babysitter stops watching TV, gets the kids, and tries to
leave the premises. That's our autovacuum scheduler! It has no
inclination or ability to see the future; it makes decisions entirely
based on the present state of things. In a lot of cases that's OK, but
sometimes it leads to a completely ridiculous outcome.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: ExecRTCheckPerms() and many prunable partitions
Next
From: Alvaro Herrera
Date:
Subject: Re: Preventing abort() and exit() calls in libpq