Re: New IndexAM API controlling index vacuum strategies - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: New IndexAM API controlling index vacuum strategies
Date
Msg-id CAH2-Wz=+m+L3AYAL4DV=SMRr2SCjEBcXuebMd-n0Aevdemj0ZQ@mail.gmail.com
Whole thread Raw
In response to Re: New IndexAM API controlling index vacuum strategies  (Andres Freund <andres@anarazel.de>)
Responses Re: New IndexAM API controlling index vacuum strategies
List pgsql-hackers
On Wed, Apr 14, 2021 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
> The reason I didn't do further reviews for things in this thread was
> that I was trying really hard to get the shared memory stats patch into
> a committable shape - there were just not enough hours in the day. I
> think it's to be expected that, during the final CF, there aren't a lot
> of resources for reviewing patches that are substantially new.  Why
> should these new patches have gotten priority over a much older patch
> set that also address significant operational issues?

We're all doing our best.

> It's very common for larger / busier databases to *substantially*
> increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
> of headroom, but a few hundred million. The cost of doing unnecessary
> anti-wraparound vacuums is just too great. And databases on the busier &
> larger side of things are precisely the ones that are more likely to hit
> wraparound issues (otherwise you're just not that likely to burn through
> that many xids).

I think that this was once true, but is now much less common, mostly
due to the freeze map stuff in 9.6. And due a general recognition that
the *risk* of increasing them is just too great (a risk that we can
hope was diminished by the failsafe, incidentally). As an example of
this, Christophe Pettus had a Damascene conversion when it came to
increasing autovacuum_freeze_max_age aggressively, which we explains
here:

https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

In short, he went from regularly advising clients to increase
autovacuum_freeze_max_age to telling them to specifically advising
them to never touch them.

Even if we assume that I'm 100% wrong about autovacuum_freeze_max_age,
it's still true that the vacuum_failsafe_age GUC is interpreted with
reference to autovacuum_freeze_max_age -- it will always be
interpreted as if it was set to 105% of whatever the current value of
autovacuum_freeze_max_age happens to be (so it's symmetric with the
freeze_table_age GUC and its 95% behavior). So it's never completely
unreasonable in the sense that it directly clashes with an existing
autovacuum_freeze_max_age setting from before the upgrade.

Of course this doesn't mean that there couldn't possibly be any
problems with the new mechanism clashing with
autovacuum_freeze_max_age in some unforeseen way. But, the worst that
can happen is that a user that is sophisticated enough to very
aggressively increase autovacuum_freeze_max_age upgrades to Postgres
14, and then finds that index vacuuming is sometimes skipped. Which
they'll see lots of annoying and scary messages about if they ever
look in the logs. I think that that's an acceptable price to pay to
protect the majority of less sophisticated users.

> And my concern isn't really that vacuum would have finished without a
> problem if cost limiting hadn't been disabled, but that having multiple
> autovacuum workers going all out will cause problems. Like the system
> slowing down so much that it's hard to fix the actual root cause of the
> wraparound - I've seen systems with a bunch unthrottled autovacuum
> overwhelme the IO subsystem so much that simply opening a connection to
> fix the issue took 10+ minutes. Especially on systems with provisioned
> IO (i.e. just about all cloud storage) that's not too hard to hit.

I don't think that it's reasonable to expect an intervention like this
to perfectly eliminate all risk, while at the same time never
introducing any new theoretical risks. (Especially while also being
simple and obviously correct.)

> > If it's intrinsically impossible to advance relfrozenxid, then surely
> > all bets are off. But even in this scenario it's very unlikely that we
> > wouldn't at least do index vacuuming for those index tuples that are
> > dead and safe to delete according to the OldestXmin cutoff. You still
> > have 1.6 billion XIDs before the failsafe first kicks in, regardless
> > of the issue of the OldestXmin/FreezeLimit being excessively far in
> > the past.
>
> As I said above, I don't think the "1.6 billion XIDs" argument has
> merit, because it's so reasonable (and common) to set
> autovacuum_freeze_max_age to something much larger.

No merit? Really? Not even a teeny, tiny, microscopic little bit of
merit? You're sure?

As I said, we handle the case where autovacuum_freeze_max_age is set
to something larger than vacuum_failsafe_age is a straightforward and
pretty sensible way. I am curious, though: what
autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
but somehow also not extremely ill-advised and dangerous? What number
is that, precisely? Apparently this is common, but I must confess that
it's the first I've heard about it.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: New IndexAM API controlling index vacuum strategies
Next
From: Michael Paquier
Date:
Subject: Re: File truncation within PostgresNode::issues_sql_like() wrong on Windows