Thread: Decoupling antiwraparound autovacuum from special rules around auto cancellation

Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
I think that we should decouple the PROC_VACUUM_FOR_WRAPAROUND
autocancellation behavior in ProcSleep() from antiwraparound
autovacuum itself. In other words I think that it should be possible
to cancel an autovacuum that happens to be an antiwraparound
autovacuum, just as if were any other autovacuum -- because it usually
is no different in any real practical sense. Or at least it shouldn't
be seen as fundamentally different to other autovacuums at first,
before relfrozenxid attains an appreciably greater age (definition of
"appreciably greater" is TBD).

Why should the PROC_VACUUM_FOR_WRAPAROUND behavior happen on *exactly*
the same timeline as the one used to launch an antiwraparound
autovacuum, though? There is no inherent reason why we have to do both
things at exactly the same XID-age-wise time. But there is reason to
think that doing so could make matters worse rather than better [1].

More generally I think that it'll be useful to perform "aggressive
behaviors" on their own timeline, with no two distinct aggressive
behaviors applied at exactly the same time. In general we ought to
give a less aggressive approach some room to succeed before escalating
to a more aggressive approach -- we should see if a less aggressive
approach will work on its own. The failsafe is the most aggressive
intervention of all. The PROC_VACUUM_FOR_WRAPAROUND behavior is almost
as aggressive, and should happen sooner. Antiwraparound autovacuum
itself (which is really a separate thing to
PROC_VACUUM_FOR_WRAPAROUND) is less aggressive still. Then you have
things like the cutoffs in vacuumlazy.c that control things like
freezing.

In short, having an "escalatory" approach that applies each behavior
at different times. The exact timelines we'd want are of course
debatable, but the value of having multiple distinct timelines (one
per aggressive behavior) is far less debatable. We should give
problems a chance to "resolve themselves", at least up to a point.

The latest version of my in progress VACUUM patch series [2]
completely removes the concept of aggressive VACUUM as a discrete mode
of operation inside vacuumlazy.c. Every existing "aggressive-ish
behavior" will be retained in some form or other, but they'll be
applied on separate timelines, in proportion to the problem at hand.
For example, we'll have a separate XID cutoff for waiting for a
cleanup lock the hard way -- we will no longer use FreezeLimit for
that, since that doesn't give freezing a chance to happen in the next
VACUUM. The same VACUUM operation that is the first one that is
capable of freezing should ideally not *also* be the first one that
has to wait for a cleanup lock. We should be willing to put off
waiting for a cleanup lock for much longer than we're willing to put
off freezing. Reusing the same cutoff just makes life harder.

Clearly the idea of decoupling the PROC_VACUUM_FOR_WRAPAROUND behavior
from antiwraparound autovacuum is conceptually related to my patch
series, but it can be treated as separate work. That's why I'm
starting another thread now.

There is another idea in that patch series that also seems worth
mentioning as relevant (but not essential) to this discussion on this
thread: it would be better if antiwraparound autovacuum was simply
another way to launch an autovacuum, which isn't fundamentally
different to any other. I believe that users will find this conceptual
model a lot easier, especially in a world where antiwraparound
autovacuums naturally became rare (which is the world that the big
patch series seeks to bring about). It'll make antiwraparound
autovacuum "the threshold of last resort", only needed when
conventional tuple-based thresholds don't trigger at all for an
extended period of time (e.g., for static tables).

Perhaps it won't be trivial to fix autovacuum.c in the way I have in
mind (which is to split PROC_VACUUM_FOR_WRAPAROUND into two flags that
serve two separate purposes). I haven't considered if we're
accidentally relying on the coupling to avoid confusion within
autovacuum.c. That doesn't seem important right now, though.

[1] https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015
[2] https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
-- 
Peter Geoghegan



On Wed, 2022-10-19 at 14:58 -0700, Peter Geoghegan wrote:
> Why should the PROC_VACUUM_FOR_WRAPAROUND behavior happen on
> *exactly*
> the same timeline as the one used to launch an antiwraparound
> autovacuum, though?

The terminology is getting slightly confusing here: by
"antiwraparound", you mean that it's not skipping unfrozen pages, and
therefore is able to advance relfrozenxid. Whereas the
PROC_VACUUM_FOR_WRAPAROUND is the same thing, except done with greater
urgency because wraparound is imminent. Right?

> There is no inherent reason why we have to do both
> things at exactly the same XID-age-wise time. But there is reason to
> think that doing so could make matters worse rather than better [1].

Can you explain?

Regards,
    Jeff Davis




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Oct 20, 2022 at 11:09 AM Jeff Davis <pgsql@j-davis.com> wrote:
> The terminology is getting slightly confusing here: by
> "antiwraparound", you mean that it's not skipping unfrozen pages, and
> therefore is able to advance relfrozenxid. Whereas the
> PROC_VACUUM_FOR_WRAPAROUND is the same thing, except done with greater
> urgency because wraparound is imminent. Right?

Not really.

I started this thread to discuss a behavior in autovacuum.c and proc.c
(the autocancellation behavior), which is, strictly speaking, not
related to the current vacuumlazy.c behavior we call aggressive mode
VACUUM. Various hackers have in the past described antiwraparound
autovacuum as "implying aggressive", which makes sense; what's the
point in doing an antiwraparound autovacuum that can almost never
advance relfrozenxid?

It is nevertheless true that antiwraparound autovacuum is an
independent behavior to aggressive VACUUM. The former is an autovacuum
thing, and the latter is a VACUUM thing. That's just how it works,
mechanically.

If this division seems artificial or pedantic to you, then consider
the fact that you can quite easily get a non-aggressive antiwraparound
autovacuum by using the storage option called
autovacuum_freeze_max_age (instead of the GUC):

https://postgr.es/m/CAH2-Wz=DJAokY_GhKJchgpa8k9t_H_OVOvfPEn97jGNr9W=deg@mail.gmail.com

This is even a case where we'll output a distinct description in the
server log when autovacuum logging is enabled and gets triggered. So
while there may be no point in an antiwraparound autovacuum that is
non-aggressive, that doesn't stop them from happening. Regardless of
whether or not that's an intended behavior, that's just how the
mechanism has been constructed.

> > There is no inherent reason why we have to do both
> > things at exactly the same XID-age-wise time. But there is reason to
> > think that doing so could make matters worse rather than better [1].
>
> Can you explain?

Why should the special autocancellation behavior for antiwraparound
autovacuums kick in at exactly the same point that we first launch an
antiwraparound autovacuum? Maybe that aggressive intervention will be
needed, in the end, but why start there?

With my patch series, antiwraparound autovacuums still occur, but
they're confined to things like static tables -- things that are
pretty much edge cases. They still need to behave sensibly (i.e.
reliably advance relfrozenxid based on some principled approach), but
now they're more like "an autovacuum that happens because no other
condition triggered an autovacuum". To some degree this is already the
case, but I'd like to be more deliberate about it.

Leaving my patch series aside, I still don't think that it makes sense
to make it impossible to auto-cancel antiwraparound autovacuums,
across the board, regardless of the underlying table age. We still
need something like that, but why not give a still-cancellable
autovacuum worker a chance to resolve the problem? Why take a risk of
causing much bigger problems (e.g., blocking automated DDL that blocks
simple SELECT queries) before the point that that starts to look like
the lesser risk (compared to hitting xidStopLimit)?

--
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Oct 20, 2022 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Leaving my patch series aside, I still don't think that it makes sense
> to make it impossible to auto-cancel antiwraparound autovacuums,
> across the board, regardless of the underlying table age.

One small thought on the presentation/docs side of this: maybe it
would be better to invent a new kind of autovacuum that has the same
purpose as antiwraparound autovacuum, but goes by a different name,
and doesn't have the special behavior around cancellations. We
wouldn't have to change anything about the behavior of antiwraparound
autovacuum once we reached the point of needing one.

Maybe we wouldn't even need to invent a new user-visible name for this
other kind of autovacuum. While even this so-called "new kind of
autovacuum" will be rare once my main patch series gets in, it'll
still be a totally normal occurrence. Whereas antiwraparound
autovacuums are sometimes described as an emergency mechanism.

That way we wouldn't be fighting against the widely held perception
that antiwraparound autovacuums are scary. In fact that reputation
would be fully deserved, for the first time. There are lots of
problems with the idea that antiwraparound autovacuum is kind of an
emergency thing right now, but this would make things fit the
perception, "fixing" the perception. Antiwraparound autovacuums would
become far far rarer under this scheme, but when they did happen
they'd be clear cause for concern. A useful signal for users, who
should ideally aim to never see *any* antiwraparound autovacuums.

-- 
Peter Geoghegan



On Fri, 2022-10-21 at 17:39 -0700, Peter Geoghegan wrote:
> One small thought on the presentation/docs side of this: maybe it
> would be better to invent a new kind of autovacuum

It's possible this would be easier for users to understand: one process
that does cleanup work over time in a way that minimizes interference;
and another process that activates in more urgent situations (perhaps
due to misconfiguration of the first process).

But we should be careful that we don't end up with more confusion. For
something like that to work, we'd probably want the second process to
not be configurable at all, and we'd want it to be issuing WARNINGs
pointing to what might be misconfigured, and otherwise just be
invisible.

>  That way we wouldn't be fighting against the widely held perception
> that antiwraparound autovacuums are scary.

There's certainly a terminology problem there. Just to brainstorm on
some new names, we might want to call it something like "xid
reclamation" or "xid horizon advancement".

When it starts to run out, we can use words like "wraparound" or
"exhaustion".


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Sun, Oct 23, 2022 at 9:32 PM Jeff Davis <pgsql@j-davis.com> wrote:
> It's possible this would be easier for users to understand: one process
> that does cleanup work over time in a way that minimizes interference;
> and another process that activates in more urgent situations (perhaps
> due to misconfiguration of the first process).

I think that the new "early" version of antiwraparound autovacuum
(that can still be autocancelled) would simply be called autovacuum.
It wouldn't appear as "autovacuum to prevent wraparound" in places
like pg_stat_activity. For the most part users wouldn't have to care
about the difference between these autovacuums and traditional
non-antiwraparound autovacuums. They really would be exactly the same
thing, so it would make sense if users typically noticed no difference
whatsoever (at least in contexts like pg_stat_activity).

> But we should be careful that we don't end up with more confusion. For
> something like that to work, we'd probably want the second process to
> not be configurable at all, and we'd want it to be issuing WARNINGs
> pointing to what might be misconfigured, and otherwise just be
> invisible.

There should be some simple scheme for determining when an
antiwraparound autovacuum (non-cancellable autovacuum to advance
relfrozenxid/relminmxid) should run (applied by the autovacuum.c
scheduling logic). Something like "table has attained an age that's
now 2x autovacuum_freeze_max_age, or 1/2 of vacuum_failsafe_age,
whichever is less".

The really important thing is giving a regular/early autocancellable
autovacuum triggered by age(relfrozenxid) *some* opportunity to run. I
strongly suspect that the exact details won't matter too much,
provided we manage to launch at least one such autovacuum before
escalating to traditional antiwraparound autovacuum (which cannot be
autocancelled). Even if regular/early autovacuum had just one
opportunity to run to completion, we'd already be much better off. The
hazards from blocking automated DDL in a way that leads to a very
disruptive traffic jam (like in the Joyent Manta postmortem) would go
way down.

> >  That way we wouldn't be fighting against the widely held perception
> > that antiwraparound autovacuums are scary.
>
> There's certainly a terminology problem there. Just to brainstorm on
> some new names, we might want to call it something like "xid
> reclamation" or "xid horizon advancement".

I think that we should simply call it autovacuum. Under this scheme,
antiwraparound autovacuum would be a qualitatively different kind of
operation to users (though not to vacuumlazy.c), because it would not
be autocancellable in the standard way. And because users should take
it as a signal that things aren't really working well (otherwise we
wouldn't have reached the point of requiring a scary antiwraparound
autovacuum in the first place). Right now antiwraparound autovacuums
are both an emergency thing (or at least described as such in one or
two areas of the source code), and a completely routine occurrence.
This is deeply confusing.

Separately, I plan on breaking out insert-triggered autovacuums from
traditional dead tuple triggered autovacuums [1], which creates a need
to invent some kind of name to differentiate the new table age
triggering criteria from both insert-driven and dead tuple driven
autovacuums. These are all fundamentally the same operations with the
same urgency to users, though. We'd only need to describe the
*criteria* that *triggered* the autovacuum in our autovacuum log
report (actually we'd still report autovacuums aš antiwraparound
autovacuum in cases where that still happened, which won't be
presented as just another triggering criteria in the report).

[1] https://www.postgresql.org/message-id/flat/CAH2-WznEqmkmry8feuDK8XdpH37-4anyGF7a04bWXOc1GKd0Yg@mail.gmail.com
--
Peter Geoghegan



On Mon, 2022-10-24 at 07:25 -0700, Peter Geoghegan wrote:
> The really important thing is giving a regular/early autocancellable
> autovacuum triggered by age(relfrozenxid) *some* opportunity to run.

+1. That principle seems both reasonable from a system standpoint and
understandable to a user.

> Even if regular/early autovacuum had just one
> opportunity to run to completion, we'd already be much better off.

By "opportunity", you mean that, regardless of configuration, the
cancellable autovacuum would at least start; though it still might be
cancelled by DDL. Right?

> These are all fundamentally the same operations with the
> same urgency to users, though. We'd only need to describe the
> *criteria* that *triggered* the autovacuum in our autovacuum log
> report

Hmm... I'm worried that could be a bit confusing depending on how it's
done. Let's be clear that it was merely the triggering criteria and
doesn't necessarily represent the work that is being done.

There are enough cases that it would be good to start a document and
outline the end behavior that your patch series is designed to
accomplish. In other words, a before/after of the interesting cases.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Mon, Oct 24, 2022 at 8:43 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > Even if regular/early autovacuum had just one
> > opportunity to run to completion, we'd already be much better off.
>
> By "opportunity", you mean that, regardless of configuration, the
> cancellable autovacuum would at least start; though it still might be
> cancelled by DDL. Right?

Yes, exactly.

It might be difficult as a practical matter to make sure that we
*reliably* give autovacuum.c the opportunity to launch a "standard"
autovacuum tasked with advancing relfrozenxid (just after
autovacuum_freeze_max_age is first crossed) before the point that a
scary antiwraparound autovacuum is launched. So we might end up giving
it more XID slack than it's likely to ever need (say by only launching
a traditional antiwraparound autovacuum against tables that attain an
age that is twice the value of autovacuum_freeze_max_age). These are
all just details, though -- the important principle is that we try our
utmost to give the less disruptive strategy a chance to succeed before
concluding that it has failed, and then "escalating" to a traditional
antiwraparound autovacuum.

> > These are all fundamentally the same operations with the
> > same urgency to users, though. We'd only need to describe the
> > *criteria* that *triggered* the autovacuum in our autovacuum log
> > report
>
> Hmm... I'm worried that could be a bit confusing depending on how it's
> done. Let's be clear that it was merely the triggering criteria and
> doesn't necessarily represent the work that is being done.

Maybe it could be broken out into a separate "autovacuum triggered by:
" line, seen only in the autovacuum log instrumentation (and not in
the similar report output by a manual VACUUM VERBOSE). When we still
end up "escalating" to an antiwraparound autovacuum, the "triggered
by:" line would match whatever we'd show in the benign the
non-cancellable-but-must-advance-relfrozenxid autovacuum case. The
difference would be that we'd now be reporting on a different
operation entirely (not just a regular autovacuum, a scary
antiwraparound autovacuum).

(Again, even these distinctions wouldn't be meaningful to vacuumlazy.c
itself -- it would just need to handle the details around logging in a
way that gave users the right idea. There wouldn't be any special
discrete aggressive mode of operation anymore, assuming my big patch
set gets into Postgres 16 too.)

> There are enough cases that it would be good to start a document and
> outline the end behavior that your patch series is designed to
> accomplish. In other words, a before/after of the interesting cases.

That's on my TODO list. Mostly it's an independent thing to this
(antiwraparound) autovacuum stuff, despite the fact that both projects
share the same underlying philosophy.

-- 
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Mon, Oct 24, 2022 at 9:00 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe it could be broken out into a separate "autovacuum triggered by:
> " line, seen only in the autovacuum log instrumentation (and not in
> the similar report output by a manual VACUUM VERBOSE). When we still
> end up "escalating" to an antiwraparound autovacuum, the "triggered
> by:" line would match whatever we'd show in the benign the
> non-cancellable-but-must-advance-relfrozenxid autovacuum case. The
> difference would be that we'd now be reporting on a different
> operation entirely (not just a regular autovacuum, a scary
> antiwraparound autovacuum).

Attached WIP patch invents the idea of a regular autovacuum that is
tasked with advancing relfrozenxid -- which is really just another
trigger criteria, reported on in the server log in its autovacuum
reports. Of course we retain the idea of antiwraparound autovacuums.
The only difference is that they are triggered when table age has
advanced by twice the usual amount, which is presumably only possible
because a regular autovacuum couldn't start or couldn't complete in
time (most likely due to continually being auto-cancelled).

As I said before, I think that the most important thing is to give
regular autovacuuming a chance to succeed. The exact approach taken
has a relatively large amount of slack, but that probably isn't
needed. So the new antiwraparound cutoffs were chosen because they're
easy to understand and remember, which is fairly arbitrary.

Adding this to the upcoming CF.

-- 
Peter Geoghegan

Attachment
On Fri, 2022-11-25 at 14:47 -0800, Peter Geoghegan wrote:
> Attached WIP patch invents the idea of a regular autovacuum that is
> tasked with advancing relfrozenxid -- which is really just another
> trigger criteria, reported on in the server log in its autovacuum
> reports. Of course we retain the idea of antiwraparound autovacuums.
> The only difference is that they are triggered when table age has
> advanced by twice the usual amount, which is presumably only possible
> because a regular autovacuum couldn't start or couldn't complete in
> time (most likely due to continually being auto-cancelled).
> 
> As I said before, I think that the most important thing is to give
> regular autovacuuming a chance to succeed. The exact approach taken
> has a relatively large amount of slack, but that probably isn't
> needed. So the new antiwraparound cutoffs were chosen because they're
> easy to understand and remember, which is fairly arbitrary.

The target is a table that receives no DML at all, right?
I think that is a good idea.
Wouldn't it make sense to trigger that at *half* "autovacuum_freeze_max_age"?

Yours,
Laurenz Albe



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Sat, Nov 26, 2022 at 9:58 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> The target is a table that receives no DML at all, right?

Sort of, but not really. The target is a table that doesn't get
vacuumed for any other reason -- I don't make any claims beyond that
one.

It seems a little too optimistic to suppose that such a table really
didn't need any vacuuming to deal with bloat just because autovacuum.c
didn't seem to think that it did.

> I think that is a good idea.
> Wouldn't it make sense to trigger that at *half* "autovacuum_freeze_max_age"?

That would be equivalent to what I've done here, provided you also
double the autovacuum_freeze_max_age setting. I did it this way
because I believe that it has fewer problems. The approach I took
makes the general perception that antiwraparound autovacuum are a
scary thing (really just needed for emergencies) become true, for the
first time.

We should expect to see very few antiwraparound autovacuums with the
patch, but when we do see even one it'll be after a less aggressive
approach was given the opportunity to succeed, but (for whatever
reason) failed. Just seeing any antiwraparound autovacuums will become
a useful signal of something being amiss in a way that it just isn't
at the moment. The docs can be updated to talk about antiwraparound
autovacuum as a thing that you should encounter approximately never.
This is possible even though the patch isn't invasive at all.

-- 
Peter Geoghegan



On Sat, 2022-11-26 at 11:00 -0800, Peter Geoghegan wrote:

> > I think that is a good idea.
> > Wouldn't it make sense to trigger that at *half* "autovacuum_freeze_max_age"?
> 
> That would be equivalent to what I've done here, provided you also
> double the autovacuum_freeze_max_age setting.

That's exactly what I was trying to debate.  Wouldn't it make sense to
trigger VACUUM earlier so that it has a chance of being less heavy?
On the other hand, if there are not sufficiently many modifications
on the table to trigger autovacuum, perhaps it doesn't matter in many
cases.

> I did it this way
> because I believe that it has fewer problems. The approach I took
> makes the general perception that antiwraparound autovacuum are a
> scary thing (really just needed for emergencies) become true, for the
> first time.
> 
> We should expect to see very few antiwraparound autovacuums with the
> patch, but when we do see even one it'll be after a less aggressive
> approach was given the opportunity to succeed, but (for whatever
> reason) failed.

Is that really so much less aggressive?  Will that autovacuum run want
to process all pages that are not all-frozen?  If not, it probably won't
do much good.  If yes, it will be just as heavy as an anti-wraparound
autovacuum (except that it won't block other sessions).

> Just seeing any antiwraparound autovacuums will become
> a useful signal of something being amiss in a way that it just isn't
> at the moment. The docs can be updated to talk about antiwraparound
> autovacuum as a thing that you should encounter approximately never.
> This is possible even though the patch isn't invasive at all.

True.  On the other hand, it might happen that after this, people start
worrying about normal autovacuum runs because they occasionally experience
a table age autovacuum that is much heavier than the other ones.  And
they can no longer tell the reason, because it doesn't show up anywhere.

Yours,
Laurenz Albe



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Sun, Nov 27, 2022 at 8:54 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> That's exactly what I was trying to debate.  Wouldn't it make sense to
> trigger VACUUM earlier so that it has a chance of being less heavy?
> On the other hand, if there are not sufficiently many modifications
> on the table to trigger autovacuum, perhaps it doesn't matter in many
> cases.

Maybe. There is a deeper problem here, though: table age is a really
terrible proxy for whether or not it's appropriate for VACUUM to
freeze preexisting all-visible pages. It's not obvious that half
autovacuum_freeze_max_age is much better than
autovacuum_freeze_max_age if your concern is avoiding getting too far
into debt on freezing. Afterall, this is debt that must be paid back
by freezing some number of physical heap pages, which in general has
approximately zero relationship with table age (we need physical units
for this, not logical units).

This is a long standing problem that I hope and expect will be fixed
in 16, by my ongoing work to completely remove the concept of
aggressive mode VACUUM:

https://commitfest.postgresql.org/40/3843/

This makes VACUUM care about both table age and the number of unfrozen
heap pages (mostly the latter). It weighs everything at the start of
each VACUUM, and decides on how it must advance relfrozenxid based on
the conditions in the table and the picture over time. Note that
performance stability is the main goal; we will not just keep
accumulating unfrozen pages for no good reason. All of the behaviors
previously associated with aggressive mode are retained, but are
individually applied on a timeline that is attuned to the needs of the
table (we can still wait for a cleanup lock, but that happens much
later than the point that the same page first becomes eligible for
freezing, not at exactly the same time).

In short, "aggressiveness" becomes a continuous thing, rather than a
discrete mode of operation, improving performance stability. We go
back to having only one kind of lazy vacuum, which is how things
worked prior to the introduction of the visibility map. (We did have
antiwraparound autovacuums in 8.3, but we did not have
aggressive/scan_all VACUUMs at the time.)

> Is that really so much less aggressive?  Will that autovacuum run want
> to process all pages that are not all-frozen?  If not, it probably won't
> do much good.  If yes, it will be just as heavy as an anti-wraparound
> autovacuum (except that it won't block other sessions).

Even if we assume that my much bigger patch set won't make it into 16,
it'll probably still be a good idea to do this in 16. I admit that I
haven't really given that question enough thought to be sure of that,
though. Naturally my goal is to get everything in. Hopefully I'll
never have to make that call.

It is definitely true that this patch is "the autovacuum side" of the
work from the other much larger patchset (which handles "the VACUUM
side" of things). This antiwraparound patch should probably be
considered in that context, even though it's theoretically independent
work. It just worked out that way.

> True.  On the other hand, it might happen that after this, people start
> worrying about normal autovacuum runs because they occasionally experience
> a table age autovacuum that is much heavier than the other ones.  And
> they can no longer tell the reason, because it doesn't show up anywhere.

But you can tell the reason, just by looking at the autovacuum log
reports. The only thing you can't do is see "(to prevent wraparound)"
in pg_stat_activity. That (and the autocancellation behavioral change)
are the only differences.

The big picture is that users really will have no good reason to care
very much about autovacuums that were triggered to advance
relfrozenxid (at least in the common case where we haven't needed to
make them antiwraparound autovacuums). They could almost (though not
quite) now be explained as "an autovacuum that takes place because
it's been a while since we did an autovacuum to deal with bloat and/or
tuple inserts". That will at least be reasonable if you assume all of
the patches get in.

-- 
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Nov 25, 2022 at 2:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached WIP patch invents the idea of a regular autovacuum that is
> tasked with advancing relfrozenxid -- which is really just another
> trigger criteria, reported on in the server log in its autovacuum
> reports.

Attached is v2, which is just to fix bitrot. Well, mostly. I did make
one functional change in v2: the autovacuum server log reports now
separately report on table XID age and table MultiXactId age, each as
its own distinct triggering condition.

I've heard informal reports that the difference between antiwraparound
autovacuums triggered by table XID age versus table MXID age can
matter a great deal. It isn't difficult to break out that detail
anyway, so even if the distinction isn't interesting all that often we
might as well surface it to users.

I still haven't made a start on the docs for this. I'm still not sure
how much work I should do on the docs in the scope of this project
versus my project that deals with related issues in VACUUM itself. The
existing material from the "Routine Vacuuming" docs has lots of
problems, and figuring out how to approach fixing those problems seems
kind of daunting right now.

--
Peter Geoghegan

Attachment

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Dec 29, 2022 at 7:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v2, which is just to fix bitrot.

Attached is v3. We no longer apply vacuum_failsafe_age when
determining the cutoff for antiwraparound autovacuuming -- the new
approach is a bit simpler.

This is a fairly small change overall. Now any "table age driven"
autovacuum will also be antiwraparound when its
relfrozenxid/relminmxid attains an age that's either double the
relevant setting (either autovacuum_freeze_max_age or
effective_multixact_freeze_max_age), or 1 billion XIDs/MXIDs --
whichever is less.

That makes it completely impossible to disable antiwraparound
protections (the special antiwrap autocancellation behavior) for
table-age-driven autovacuums once table age exceeds 1 billion
XIDs/MXIDs. It's still possible to increase autovacuum_freeze_max_age
to well over a billion, of course. It just won't be possible to do
that while also avoiding the no-auto-cancellation behavior for those
autovacuums that are triggered due to table age crossing the
autovacuum_freeze_max_age/effective_multixact_freeze_max_age
threshold.

--
Peter Geoghegan

Attachment
Hi,

On 2023-01-08 17:49:20 -0800, Peter Geoghegan wrote:
> Teach autovacuum.c to launch "table age" autovacuums at the same point
> that it previously triggered antiwraparound autovacuums.  Antiwraparound
> autovacuums are retained, but are only used as a true option of last
> resort, when regular autovacuum has presumably tried and failed to
> advance relfrozenxid (likely because the auto-cancel behavior kept
> cancelling regular autovacuums triggered based on table age).

I've also seen the inverse, with recent versions of postgres: Autovacuum can
only ever make progress if it's an anti-wraparound vacuum, because it'll
always get cancelled otherwise.  I'm worried that substantially increasing the
time until an anti-wraparound autovacuum happens will lead to more users
running into out-of-xid shutdowns.

I don't think it's safe to just increase the time at which anti-wrap vacuums
happen to a hardcoded 1 billion.

I'm also doubtful that it's ok to just make all autovacuums on relations with
an age > 1 billion anti-wraparound ones. For people that use a large
autovacuum_freeze_max_age that will be a rude awakening.


I am all in favor for adding logic to trigger autovacuum based on the table
age, without needing to reach autovacuum_freeze_max_age. It never made sense
to me that we get to the "emergency mode" in entirely normal operation. But
I'm not in favor of just entirely reinterpreting existing GUCs and adding
important thresholds as hardcoded numbers.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 5:22 PM Andres Freund <andres@anarazel.de> wrote:
> I've also seen the inverse, with recent versions of postgres: Autovacuum can
> only ever make progress if it's an anti-wraparound vacuum, because it'll
> always get cancelled otherwise.  I'm worried that substantially increasing the
> time until an anti-wraparound autovacuum happens will lead to more users
> running into out-of-xid shutdowns.
>
> I don't think it's safe to just increase the time at which anti-wrap vacuums
> happen to a hardcoded 1 billion.

That's not what the patch does. It doubles the time that the anti-wrap
no-autocancellation behaviors kick in, up to a maximum of 1 billion
XIDs/MXIDs. So it goes from autovacuum_freeze_max_age to
autovacuum_freeze_max_age x 2, without changing the basic fact that we
initially launch autovacuums that advance relfrozenxid/relminmxid when
the autovacuum_freeze_max_age threshold is first crossed.

These heuristics are totally negotiable -- and likely should be
thought out in more detail. It's likely that most of the benefit of
the patch comes from simply trying to advance relfrozenxid without the
special auto-cancellation behavior one single time. The main problem
right now is that the threshold that launches most antiwraparound
autovacuums is exactly the same as the threshold that activates the
auto-cancellation protections. Even doing the latter very slightly
later than the former could easily make things much better, while
adding essentially no risk of the kind you're concerned about.

> I'm also doubtful that it's ok to just make all autovacuums on relations with
> an age > 1 billion anti-wraparound ones. For people that use a large
> autovacuum_freeze_max_age that will be a rude awakening.

Actually, users that have autovacuum_freeze_max_age set to over 1
billion will get exactly the same behavior as before (except that the
instrumentation of autovacuum will be better). It'll be identical.

If you set autovacuum_freeze_max_age to 2 billion, and a "standard"
autovacuum is launched on a table whose relfrozenxid age is 1.5
billion, it'll just be a regular dead tuples/inserted tuples
autovacuum, with the same old familiar locking characteristics as
today.

-- 
Peter Geoghegan



On Mon, Jan 9, 2023 at 8:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> That's not what the patch does. It doubles the time that the anti-wrap
> no-autocancellation behaviors kick in, up to a maximum of 1 billion
> XIDs/MXIDs. So it goes from autovacuum_freeze_max_age to
> autovacuum_freeze_max_age x 2, without changing the basic fact that we
> initially launch autovacuums that advance relfrozenxid/relminmxid when
> the autovacuum_freeze_max_age threshold is first crossed.

I'm skeptical about this kind of approach.

I do agree that it's good to slowly increase the aggressiveness of
VACUUM as we get further behind, rather than having big behavior
changes all at once, but I think that should happen by smoothly
varying various parameters rather than by making discrete behavior
changes at a whole bunch of different times. For instance, when VACUUM
goes into emergency mode, it stops respecting the vacuum delay. I
think that's great, but it happens all at once, and maybe it would be
better if it didn't. We could consider gradually ramping the vacuum
delay from 100% down to 0% instead of having it happen all at once.
Maybe that's not the right idea, I don't know, and a naive
implementation might be worse than nothing, but I think it has some
chance of being worth consideration.

But what the kind of change you're proposing here does is create
another threshold where the behavior changes suddenly, and I think
that's challenging from the point of view of understanding the
behavior of the system. The behavior already changes when you hit
vacuum_freeze_min_age and then again when you hit
vacuum_freeze_table_age and then there's also
autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few
others, and these setting all interact in pretty complex ways. The
more conditional logic we add to that, the harder it becomes to
understand what's actually happening. You see a system where
age(relfrozenxid) = 673m and you need a calculator and a spreadsheet
to figure out what the vacuum behavior is at that point. Honestly, I
think we already have a problem with the behaviors here being too
complex for normal human beings to understand them, and I think that
the kinds of changes you are proposing here could make that quite a
bit worse.

Now, you might reply to the above by saying, well, some behaviors
can't vary continuously. vacuum_cost_limit can perhaps be phased out
gradually, but autocancellation seems like something that you must
either do, or not do. I would agree with that. But what I'm saying is
that we ought to favor having those kinds of behaviors all engage at
the same point rather than at different times. I'm not saying that
there can't ever be good reasons to separate out different behaviors
and have the engage at different times, but I think we will end up
better off if we minimize that sort of thing as much as we reasonably
can. In your opening email you write "Why should the
PROC_VACUUM_FOR_WRAPAROUND behavior happen on *exactly* the same
timeline as the one used to launch an antiwraparound autovacuum,
though?" and my answer is "because that's easier to understand and I
don't see that it has much of a downside."

I did take a look at the post-mortem to which you linked, but I am not
quite sure how that bears on the behavior change under discussion.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Jan 12, 2023 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I do agree that it's good to slowly increase the aggressiveness of
> VACUUM as we get further behind, rather than having big behavior
> changes all at once, but I think that should happen by smoothly
> varying various parameters rather than by making discrete behavior
> changes at a whole bunch of different times.

In general I tend to agree, but, as you go on to acknowledge yourself,
this particular behavior is inherently discrete. Either the
PROC_VACUUM_FOR_WRAPAROUND behavior is in effect, or it isn't.

In many important cases the only kind of autovacuum that ever runs
against a certain big table is antiwraparound autovacuum. And
therefore every autovacuum that runs against the table must
necessarily not be auto cancellable. These are the cases where we see
disastrous interactions with automated DDL, such as a TRUNCATE run by
a cron job (to stop those annoying antiwraparound autovacuums) -- a
heavyweight lock traffic jam that causes the application to lock up.

All that I really want to do here is give an autovacuum that *can* be
auto cancelled *some* non-zero chance to succeed with these kinds of
tables. TRUNCATE completes immediately, so the AEL is no big deal.
Except when it's blocked behind an antiwraparound autovacuum. That
kind of interaction is occasionally just disastrous. Even just the
tiniest bit of wiggle room could avoid it in most cases, possibly even
almost all cases.

> Maybe that's not the right idea, I don't know, and a naive
> implementation might be worse than nothing, but I think it has some
> chance of being worth consideration.

It's a question of priorities. The failsafe isn't supposed to be used
(when it is it is a kind of a failure), and so presumably only kicks
in on very rare occasions, where nobody was paying attention anyway.
So far I've heard no complaints about this, but I've heard lots of
complaints about the antiwrap autocancellation behavior.

> The behavior already changes when you hit
> vacuum_freeze_min_age and then again when you hit
> vacuum_freeze_table_age and then there's also
> autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few
> others, and these setting all interact in pretty complex ways. The
> more conditional logic we add to that, the harder it becomes to
> understand what's actually happening.

In general I strongly agree. In fact that's a big part of what
motivates my ongoing work on VACUUM. The user experience is central.

As Andres pointed out, presenting antiwraparound autovacuums as kind
of an emergency thing but also somehow a routine thing is just
horribly confusing. I want to make them into an emergency thing in
every sense -- something that you as a user can reasonably expect to
never see (like the failsafe). But if you do see one, then that's a
useful signal of an underlying problem with contention, say from
automated DDL that pathologically cancels autovacuums again and again.

> Now, you might reply to the above by saying, well, some behaviors
> can't vary continuously. vacuum_cost_limit can perhaps be phased out
> gradually, but autocancellation seems like something that you must
> either do, or not do. I would agree with that. But what I'm saying is
> that we ought to favor having those kinds of behaviors all engage at
> the same point rather than at different times.

Right now aggressive VACUUMs do just about all freezing at the same
time, to the extent that many users seem to think that it's a totally
different thing with totally different responsibilities to regular
VACUUM. Doing everything at the same time like that causes huge
practical problems, and is very confusing.

I think that users will really appreciate having only one kind of
VACUUM/autovacuum (since the other patch gets rid of discrete
aggressive mode VACUUMs). I want "table age autovacuuming" (as I
propose to call it) come to be seen as not any different to any other
autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
autovacuum. The difference is only in how autovacuum.c triggers the
VACUUM, not in any runtime behavior. That's an important goal here.

> I did take a look at the post-mortem to which you linked, but I am not
> quite sure how that bears on the behavior change under discussion.

The post-mortem involved a single "DROP TRIGGER" that caused chaos
when it interacted with the auto cancellation behavior. It would
usually completely instantly, so the AEL wasn't actually disruptive,
but one day antiwraparound autovacuum made the cron job effectively
block all reads and writes for hours.

The similar outages I was called in to help with personally had either
an automated TRUNCATE or an automated CREATE INDEX. Had autovacuum
only been willing to yield once or twice, then it probably would have
been fine -- the situation probably would have worked itself out
naturally. That's the best outcome you can hope for.

--
Peter Geoghegan



On Thu, Jan 12, 2023 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> All that I really want to do here is give an autovacuum that *can* be
> auto cancelled *some* non-zero chance to succeed with these kinds of
> tables. TRUNCATE completes immediately, so the AEL is no big deal.
> Except when it's blocked behind an antiwraparound autovacuum. That
> kind of interaction is occasionally just disastrous. Even just the
> tiniest bit of wiggle room could avoid it in most cases, possibly even
> almost all cases.

I doubt it. Wiggle room that's based on the XID threshold being
different for one behavior vs. another can easily fail to produce any
benefit, because there's no guarantee that the autovacuum launcher
will ever try to launch a worker against that table while the XID is
in the range where you'd get one behavior and not the other.  I've
long thought that the fact that vacuum_freeze_table_age is documented
as capped at 0.95 * autovacuum_freeze_max_age is silly for just this
reason. The interval that you're proposing is much wider so the
chances of getting a benefit are greater, but supposing that it's
going to solve it in most cases seems like an exercise in unwarranted
optimism.

In fact, I would guess that in fact it will very rarely solve the
problem. Normally, the XID age of a table never reaches
autovacuum_freeze_max_age in the first place. If it does, there's some
reason. Maybe there's a really old open transaction or an abandon
replication slot or an unresolved 2PC transaction. Maybe the
autovacuum system is overloaded and no table is getting visited
regularly because the system just can't keep up. Or maybe there are
regular AELs being taken on the table at issue. If there's only an AEL
taken against a table once in blue moon, some autovacuum attempt ought
to succeed before we reach autovacuum_freeze_max_age. Flipping that
around, if we reach autovacuum_freeze_max_age without advancing
relfrozenxid, and an AEL shows up behind us in the lock queue, it's
really likely that the reason *why* we've reached
autovacuum_freeze_max_age is that this same thing has happened to
every previous autovacuum attempt and they all cancelled themselves.
If we cancel ourselves too, we're just postponing resolution of the
problem to some future point when we decide to stop cancelling
ourselves. That's not a win.

> I think that users will really appreciate having only one kind of
> VACUUM/autovacuum (since the other patch gets rid of discrete
> aggressive mode VACUUMs). I want "table age autovacuuming" (as I
> propose to call it) come to be seen as not any different to any other
> autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
> autovacuum. The difference is only in how autovacuum.c triggers the
> VACUUM, not in any runtime behavior. That's an important goal here.

I don't agree with that goal. I think that having different kinds of
autovacuums with different, identifiable names and corresponding,
easily-identifiable behaviors is really important for troubleshooting.
Trying to remove those distinctions and make everything look the same
will not keep autovacuum from getting itself into trouble. It will
just make it harder to understand what's happening when it does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Jan 12, 2023 at 1:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I doubt it. Wiggle room that's based on the XID threshold being
> different for one behavior vs. another can easily fail to produce any
> benefit, because there's no guarantee that the autovacuum launcher
> will ever try to launch a worker against that table while the XID is
> in the range where you'd get one behavior and not the other.

Of course it's true that in general it might not succeed in
forestalling the auto cancellation behavior. You can say something
similar about approximately anything like this. For example, there is
no absolute guarantee that any autovacuum will ever complete. But we
still try!

> I've long thought that the fact that vacuum_freeze_table_age is documented
> as capped at 0.95 * autovacuum_freeze_max_age is silly for just this
> reason. The interval that you're proposing is much wider so the
> chances of getting a benefit are greater, but supposing that it's
> going to solve it in most cases seems like an exercise in unwarranted
> optimism.

I don't claim to be dealing in certainties, especially about the final
outcome. Whether or not you accept my precise claim is perhaps not
important, in the end. What is important is that we give things a
chance to succeed, based on the information that we have available,
with a constant eye towards avoiding disaster scenarios.

Some of the problems with VACUUM seem to be cases where VACUUM takes
on a potentially ruinous obligation, that it cannot possibly meet in
some rare cases that do come up sometimes -- like the cleanup lock
behavior. Is a check for $1000 written by me really worth less than a
check written by me for a billion dollars? They're both nominally
equivalent guarantees about an outcome, after all, though one has a
far greater monetary value. Which would you value more, subjectively?

Nothing is guaranteed -- even (and perhaps especially) strong guarantees.

> In fact, I would guess that in fact it will very rarely solve the
> problem. Normally, the XID age of a table never reaches
> autovacuum_freeze_max_age in the first place. If it does, there's some
> reason.

Probably, but none of this matters at all if the table age never
reaches autovacuum_freeze_max_age in the first place. We're only
talking about tables where that isn't the case, by definition.
Everything else is out of scope here.

> Maybe there's a really old open transaction or an abandon
> replication slot or an unresolved 2PC transaction. Maybe the
> autovacuum system is overloaded and no table is getting visited
> regularly because the system just can't keep up. Or maybe there are
> regular AELs being taken on the table at issue.

Maybe an asteroid hits the datacenter, making all of these
considerations irrelevant. But perhaps it won't!

> If there's only an AEL
> taken against a table once in blue moon, some autovacuum attempt ought
> to succeed before we reach autovacuum_freeze_max_age. Flipping that
> around, if we reach autovacuum_freeze_max_age without advancing
> relfrozenxid, and an AEL shows up behind us in the lock queue, it's
> really likely that the reason *why* we've reached
> autovacuum_freeze_max_age is that this same thing has happened to
> every previous autovacuum attempt and they all cancelled themselves.

Why do you assume that a previous autovacuum ever got launched in the
first place? There is always going to be a certain kind of table that
can only get an autovacuum when its table age crosses
autovacuum_freeze_max_age. And it's not just static tables -- there is
very good reason to have doubts about the statistics that drive
autovacuum. Plus vacuum_freeze_table_age works very unreliably (which
is why my big VACUUM patch more or less relegates it to a
compatibility option, while retaining a more sophisticated notion of
table age creating pressure to advance relfrozenxid).

Under the scheme from this autovacuum patch, it really does become
reasonable to make a working assumption that there was a previous
autovacuum, that failed (likely due to the autocancellation behavior,
as you said). We must have tried and failed in an earlier autovacuum,
once we reach the point of needing an antiwraparound autovacuum
(meaning a table age autovacuum which cannot be autocancelled) --
which is not the case today at all. If nothing else, table age
autovacuums will have been scheduled much earlier on -- they will have
at least started up, barring pathological cases.

That's a huge difference in the strength of the signal, compared to today.

The super aggressive autocancellation behavior is actually
proportionate to the problem at hand. Kind of like how if you go to
the doctor and tell them you have a headache, they don't schedule you
for emergency brain surgery. What they do is tell you to take an
aspirin, and make sure that you stay well hydrated -- if the problem
doesn't go away after a few days, then call back, reassess. Perhaps it
really will be a brain tumor, but there is nothing to gain and
everything to lose by taking such drastic action at the first sign of
trouble.

> If we cancel ourselves too, we're just postponing resolution of the
> problem to some future point when we decide to stop cancelling
> ourselves. That's not a win.

It's also only a very minor loss, relative to what would have happened
without any of this. This is something that we can be relatively sure
of (unlike anything about final outcomes). It's clear that we have a
lot to gain. What do we have to lose, really?

> > I think that users will really appreciate having only one kind of
> > VACUUM/autovacuum (since the other patch gets rid of discrete
> > aggressive mode VACUUMs). I want "table age autovacuuming" (as I
> > propose to call it) come to be seen as not any different to any other
> > autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
> > autovacuum. The difference is only in how autovacuum.c triggers the
> > VACUUM, not in any runtime behavior. That's an important goal here.
>
> I don't agree with that goal. I think that having different kinds of
> autovacuums with different, identifiable names and corresponding,
> easily-identifiable behaviors is really important for troubleshooting.

You need to distinguish between different types of autovacuums and
different types of VACUUMs here. Sure, it's valuable to have
information about why autovacuum launched a VACUUM, and the patch
greatly improves that. But runtime behavior is another story.

It's not really generic behavior -- more like generic policies that
produce different behavior under different runtime conditions. VACUUM
has always had generic policies about how to do things, at least up
until the introduction of the visibility map, which added
scan_all/aggressive VACUUMs, and the vacuum_freeze_table_age GUC. The
policy should be the same in every VACUUM, which the behavior itself
emerges from.

> Trying to remove those distinctions and make everything look the same
> will not keep autovacuum from getting itself into trouble. It will
> just make it harder to understand what's happening when it does.

The point isn't to have every VACUUM behave in the same way. The point
is to make decisions dynamically, based on the observed conditions in
the table. And to delay committing to things until there really is no
alternative, to maximize our opportunities to avoid disaster. In
short: loose, springy behavior.

Imposing absolute obligations on VACUUM has the potential to create
lots of problems. It is sometimes necessary, but can easily be
overused, making a bad situation much worse.

-- 
Peter Geoghegan



Hi,

On 2023-01-12 16:08:06 -0500, Robert Haas wrote:
> Normally, the XID age of a table never reaches autovacuum_freeze_max_age in
> the first place.

That's not at all my experience. I often see it being the primary reason for
autovacuum vacuuming large tables on busy OLTP systems. Even without any
longrunning transactions or such, with available autovac workers and without
earlier autovacuums getting interrupted by locks.  Once a table is large,
reasonable scale factors require a lot of changes to accumulate to trigger an
autovacuum, and during a vacuum a lot of transactions complete, leading to
large tables having a significant age by the time autovac finishes.

The most common "bad" reason for reaching autovacuum_freeze_max_age that I see
is cost limits not allowing vacuum to complete on time.


Perhaps we should track how often autovacuum was triggered by what reason in
a relation's pgstats? That'd make it a lot easier to collect data, both for
tuning the thresholds on a system, and for hacking on postgres.

Tracking the number of times autovacuum was interruped due to a lock request
might be a good idea as well?


I think it'd be a good idea to split off the part of the patch that introduces
AutoVacType / adds logging for what triggered. That's independently useful,
likely uncontroversial and makes the remaining patch smaller.

I'd also add the trigger to the pg_stat_activity entry for the autovac
worker. Additionally I think we should add information about using failsafe
mode to the p_s_a entry.


I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
times, FWIW. And I've seen plenty of places write their own userspace version
of it, because without it they run into trouble.  However, I don't like that
the patch infers the various thresholds using magic constants / multipliers.


autovacuum_freeze_max_age is really a fairly random collection of things:
1) triggers autovacuum on tables based on age, in addition to the dead tuple /
   inserted tuples triggers
2) prevents auto-cancellation of autovacuum
3) starts autovacuum, even if autovacuum is disabled

IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2)
to make progress is a bit more concerning. 3) should rarely be needed, but is
a good safety mechanism.

I doubt that controlling all of them via one GUC is sensible.


If I understand the patch correctly, we now have the following age based
thresholds for av:

- force-enable autovacuum:
  oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
- autovacuum based on age:
  freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
    tableagevac = relfrozenxid < recentXid - freeze_max_age
- prevent auto-cancellation:
  freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
  prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
  prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age

Is that right?


One thing I just noticed: Isn't it completely bonkers that we compute
recentXid/recentMulti once at the start of a worker in
relation_needs_vacanalyze()?  That's fine for the calls in do_autovacuum()'s
initial loops over all tables. But seems completely wrong for the later calls
via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() ->
relation_needs_vacanalyze()?

These variables really shouldn't be globals. It makes sense to cache them
locally in do_autovacuum(), but reusing them
recheck_relation_needs_vacanalyze() and sharing it between launcher and worker
is bad.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-12 16:08:06 -0500, Robert Haas wrote:
> > Normally, the XID age of a table never reaches autovacuum_freeze_max_age in
> > the first place.
>
> That's not at all my experience. I often see it being the primary reason for
> autovacuum vacuuming large tables on busy OLTP systems. Even without any
> longrunning transactions or such, with available autovac workers and without
> earlier autovacuums getting interrupted by locks.  Once a table is large,
> reasonable scale factors require a lot of changes to accumulate to trigger an
> autovacuum, and during a vacuum a lot of transactions complete, leading to
> large tables having a significant age by the time autovac finishes.

I've definitely seen this. I've also noticed that TPC-C's stock and
customer tables (both of which are subject to many HOT updates) only
ever receive antiwraparound autovacuums, even with my very aggressive
autovacuum settings.

Overall, I think that it's quite common among the largest tables, even
when things are running normally.

> The most common "bad" reason for reaching autovacuum_freeze_max_age that I see
> is cost limits not allowing vacuum to complete on time.

There are many problems with the statistics driving this whole
process, that I won't rehash right now. I actually think that the
whole idea of relying on statistical sampling for dead tuples is
fundamentally just bogus (that's not how statistics work in general),
though I have quite a few less fundamental and more concrete
complaints about the statistics just being wrong on their own terms.

> Perhaps we should track how often autovacuum was triggered by what reason in
> a relation's pgstats? That'd make it a lot easier to collect data, both for
> tuning the thresholds on a system, and for hacking on postgres.

That would definitely be useful.

> Tracking the number of times autovacuum was interruped due to a lock request
> might be a good idea as well?

Also useful information worth having.

> I think it'd be a good idea to split off the part of the patch that introduces
> AutoVacType / adds logging for what triggered. That's independently useful,
> likely uncontroversial and makes the remaining patch smaller.

I like that idea.

Attached revision v4 breaks things up mechanically, along those lines
(no real changes to the code inself, though). The controversial parts
of the patch are indeed a fairly small proportion of the total
changes.

> I'd also add the trigger to the pg_stat_activity entry for the autovac
> worker. Additionally I think we should add information about using failsafe
> mode to the p_s_a entry.

I agree that that's all useful, but it seems like it can be treated as
later work.

> I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
> times, FWIW. And I've seen plenty of places write their own userspace version
> of it, because without it they run into trouble.  However, I don't like that
> the patch infers the various thresholds using magic constants / multipliers.

As I said, these details are totally negotiable, and likely could be a
lot better without much effort.

What are your concerns about the thresholds? For example, is it that
you can't configure the behavior directly at all? Something else?

> autovacuum_freeze_max_age is really a fairly random collection of things:
> 1) triggers autovacuum on tables based on age, in addition to the dead tuple /
>    inserted tuples triggers
> 2) prevents auto-cancellation of autovacuum
> 3) starts autovacuum, even if autovacuum is disabled
>
> IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2)
> to make progress is a bit more concerning. 3) should rarely be needed, but is
> a good safety mechanism.
>
> I doubt that controlling all of them via one GUC is sensible.

I agree, of course, but just to be clear: I don't think it matters
that we couple together 1 and 3. In fact it's good that we do that,
because the point to the user is that they cannot disable table-age
(i.e. what we current call antiwraparound) autovacuums -- that just
makes sense.

The only problem that I see is that item 2 is tied to the other items
from your list.

> If I understand the patch correctly, we now have the following age based
> thresholds for av:
>
> - force-enable autovacuum:
>   oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
> - autovacuum based on age:
>   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
>     tableagevac = relfrozenxid < recentXid - freeze_max_age
> - prevent auto-cancellation:
>   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
>   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
>   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age
>
> Is that right?

That summary looks accurate, but I'm a bit confused about why you're
asking the question this way. I thought that it was obvious that the
patch doesn't change most of these things.

The only mechanism that the patch changes is related to "prevent
auto-cancellation" behaviors -- which is now what the term
"antiwraparound" refers to. It does change the name of "autovacuum
based on age", though -- the name is now "table age autovacuum" (the
old name was antiwraparound autovacuum, of course). As I pointed out
to you already, it's mechanically impossible for any autovacuum to be
antiwraparound unless it's an XID table age/MXID table age autovacuum.

The naming convention I propose here makes it a little confusing for
us to discuss, but it seems like the best thing for users. Users'
basic intuitions about antiwraparound autovacuums (that they're scary
things needed because wraparound is starting to become a real concern)
don't need to change. If anything they become more accurate, because
antiwraparound autovacuums become non-routine -- which is really how
it should have been when autovacuum was first added IMV. Users have
rather good reasons to find antiwraparound autovacuums scary, even
though that's kind of wrong (it's really our fault for making it so
confusing for them, not their fault for being wrong).

> One thing I just noticed: Isn't it completely bonkers that we compute
> recentXid/recentMulti once at the start of a worker in
> relation_needs_vacanalyze()?  That's fine for the calls in do_autovacuum()'s
> initial loops over all tables. But seems completely wrong for the later calls
> via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() ->
> relation_needs_vacanalyze()?
>
> These variables really shouldn't be globals. It makes sense to cache them
> locally in do_autovacuum(), but reusing them
> recheck_relation_needs_vacanalyze() and sharing it between launcher and worker
> is bad.

I am not sure. I do hope that there isn't some subtle way in which the
design relies on that. It seems obviously weird, and so I have to
wonder if there is a reason behind it that isn't immediately apparent.

--
Peter Geoghegan

Attachment
Hi,

On 2023-01-13 16:13:45 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it'd be a good idea to split off the part of the patch that introduces
> > AutoVacType / adds logging for what triggered. That's independently useful,
> > likely uncontroversial and makes the remaining patch smaller.
>
> I like that idea.

Cool.


> Attached revision v4 breaks things up mechanically, along those lines
> (no real changes to the code inself, though). The controversial parts
> of the patch are indeed a fairly small proportion of the total
> changes.

I don't think the split is right. There's too much in 0001 - it's basically
introducing the terminology of 0002 already. Could you make it a much more
minimal change?


> > I'd also add the trigger to the pg_stat_activity entry for the autovac
> > worker. Additionally I think we should add information about using failsafe
> > mode to the p_s_a entry.
>
> I agree that that's all useful, but it seems like it can be treated as
> later work.

IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't
allow to distinguish most of them from a plain autovacuum.

Seems pretty easy to display the trigger from 0001 in
autovac_report_activity()? You'd have to move the AutoVacType -> translated
string mapping into a separate function. That seems like a good idea anyway,
the current coding makes translators translate several largely identical
strings that just differ in one part.



> > I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
> > times, FWIW. And I've seen plenty of places write their own userspace version
> > of it, because without it they run into trouble.  However, I don't like that
> > the patch infers the various thresholds using magic constants / multipliers.
>
> As I said, these details are totally negotiable, and likely could be a
> lot better without much effort.
>
> What are your concerns about the thresholds? For example, is it that
> you can't configure the behavior directly at all? Something else?

The above, but mainly that

> >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> >   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
> >   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age

seems quite confusing / non-principled. What's the logic behind the auto
cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is
1 billion, the cutoff is set to 1 billion? That just makes no sense to me.


Maye I'm partially confused by the Min(freeze_max_age * 2, 1 billion). As you
pointed out in [1], that doesn't actually lower the threshold for "table age"
vacuums, because we don't even get to that portion of the code if we didn't
already cross freeze_max_age.

    if (freeze_max_age < ANTIWRAPAROUND_MAX_AGE)
            freeze_max_age *= 2;
    freeze_max_age = Min(freeze_max_age, ANTIWRAPAROUND_MAX_AGE);

You're lowering a large freeze_max_age to ANTIWRAPAROUND_MAX_AGE - but it only
happens after all other checks of freeze_max_age, so it won't influence
those. That's confusing code.

I think this'd be a lot more readable if you introduced a separate variable
for the "no-cancel" threshold, rather than overloading freeze_max_age with
different meanings. And you should remove the confusing "lowering" of the
cutoff.  Maybe something like

    no_cancel_age = freeze_max_age;
    if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE)
    {
        /* multiply by two, but make sure to not exceed ANTIWRAPAROUND_MAX_AGE */
        no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, (uint32)no_cancel_age * 2);
    }

The uint32 bit isn't needed with ANTIWRAPAROUND_MAX_AGE at 1 billion, but at
1.2 it would be needed, so it seems better to have it.


That still doesn't explain why we the cancel_age = freeze_max_age * 2
behaviour should be clamped at 1 billion, though.



> > If I understand the patch correctly, we now have the following age based
> > thresholds for av:
> >
> > - force-enable autovacuum:
> >   oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
> > - autovacuum based on age:
> >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> >     tableagevac = relfrozenxid < recentXid - freeze_max_age
> > - prevent auto-cancellation:
> >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> >   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
> >   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age
> >
> > Is that right?
>
> That summary looks accurate, but I'm a bit confused about why you're
> asking the question this way. I thought that it was obvious that the
> patch doesn't change most of these things.

For me it was helpful to clearly list the triggers when thinking about the
issue. I found the diff hard to read and, as noted above, the logic for the
auto cancel threshold quite confusing, so ...


> The only mechanism that the patch changes is related to "prevent
> auto-cancellation" behaviors -- which is now what the term
> "antiwraparound" refers to.

Not sure that redefining what a long-standing name refers to is helpful. It
might be best to retire it and come up with new names.


> It does change the name of "autovacuum based on age", though -- the name is
> now "table age autovacuum" (the old name was antiwraparound autovacuum, of
> course). As I pointed out to you already, it's mechanically impossible for
> any autovacuum to be antiwraparound unless it's an XID table age/MXID table
> age autovacuum.

Thinking about it a bit more, my problem with the current anti-wraparound logic boil
down to a few different aspects:

1) It regularly scares the crap out of users, even though it's normal.  This
   is further confounded by failsafe autovacuums, where a scared reaction is
   appropriate, not being visible in pg_stat_activity.

   I suspect that learning that "vacuum to prevent wraparound" isn't a
   problem, contributes to people later ignoring "must be vacuumed within ..."
   WARNINGS, which I've seen plenty times.

2) It makes it hard to DROP, TRUNCATE, VACUUM FULL or even manually VACUUM
   tables being anti-wraparound vacuumed, even though those manual actions will
   often resolve the issue much more quickly.

3) Autovacuums triggered by tuple thresholds persistently getting cancelled
   also regularly causes outages, and they make it more likely that an
   eventual age-based vacuum will take forever.


Aspect 1) is addressed to a good degree by the proposed split of anti-wrap
into an age and anti-cancel triggers. And could be improved by reporting
failsafe autovacuums in pg_stat_activity.

Perhaps 2) could be improved a bit by emitting a WARNING message when we
didn't cancel AV because it was anti-wraparound?  But eventually I think we
somehow need to signal the "intent" of the lock drop down into ProcSleep() or
wherever it'd be.

I have two ideas around 3):

First, we could introduce thresholds for the tuple thresholds, after which
autovacuum isn't concealable anymore.

Second, we could track the number of cancellations since the last [auto]vacuum
in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been
cancelled a number of times.



Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 6:09 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think the split is right. There's too much in 0001 - it's basically
> introducing the terminology of 0002 already. Could you make it a much more
> minimal change?

Okay.

I thought that you might say that. I really just wanted to show you
how small the code footprint was for the more controversial part.

> IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't
> allow to distinguish most of them from a plain autovacuum.
>
> Seems pretty easy to display the trigger from 0001 in
> autovac_report_activity()? You'd have to move the AutoVacType -> translated
> string mapping into a separate function. That seems like a good idea anyway,
> the current coding makes translators translate several largely identical
> strings that just differ in one part.

I'll look into it. It's on my TODO list now.

> seems quite confusing / non-principled. What's the logic behind the auto
> cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is
> 1 billion, the cutoff is set to 1 billion? That just makes no sense to me.

Why is the default for autovacuum_freeze_max_age 200 million? I think
that it's because 200 million is 10% of 2 billion. And so it is for
this cap. 1 billion is 50% of 2 billion. It's just numerology. It
doesn't make sense to me either.

Of course it's not *completely* arbitrary. Obviously the final auto
cancel threshold needs to be at least somewhat greater than
freeze_max_age for any of this to make any sense at all. And, we
should ideally have a comfortable amount of slack to work with, so
that things like moderate (not pathological) autovacuum worker
starvation isn't likely to defeat our attempts at avoiding the
no-auto-cancel behavior for no good reason. Finally, once we get to a
certain table age it starts to seem like a bad idea to not be
conservative about auto cancellations.

I believe that we agree on most things when it comes to VACUUM, but
I'm pretty sure that we still disagree about the value in setting
autovacuum_freeze_max_age very high. I continue to believe that
setting it over a billion or so is just a bad idea. I'm mentioning
this only because it might give you some idea of where I'm coming from
-- in general I believe that age-based settings often have only very
weak relationships with what actually matters.

It might make sense to always give a small fixed amount of headroom
when autovacuum_freeze_max_age is set to very high values. Maybe just
5 million XIDs/MXIDs. That would probably be just as effective as
(say) 500 million in almost all cases. But then you have to accept
another magic number.

> I think this'd be a lot more readable if you introduced a separate variable
> for the "no-cancel" threshold, rather than overloading freeze_max_age with
> different meanings. And you should remove the confusing "lowering" of the
> cutoff.  Maybe something like
>
>     no_cancel_age = freeze_max_age;
>     if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE)
>     {
>         /* multiply by two, but make sure to not exceed ANTIWRAPAROUND_MAX_AGE */
>         no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, (uint32)no_cancel_age * 2);
>     }

I'm happy to do it that way, but let's decide what the algorithm
itself should be first. Or let's explore it a bit more, at least.

> > The only mechanism that the patch changes is related to "prevent
> > auto-cancellation" behaviors -- which is now what the term
> > "antiwraparound" refers to.
>
> Not sure that redefining what a long-standing name refers to is helpful. It
> might be best to retire it and come up with new names.

I generally try to avoid bike-shedding, and naming things is the
ultimate source of bike shedding. I dread having to update the docs
for this stuff, too. The docs in this area (particularing "Routine
Vacuuming") are such a mess already. But perhaps you're right.

> 1) It regularly scares the crap out of users, even though it's normal.  This
>    is further confounded by failsafe autovacuums, where a scared reaction is
>    appropriate, not being visible in pg_stat_activity.

The docs actually imply that when the system reaches the point of
entering xidStopLimit mode, you might get data corruption. Of course
that's not true (the entire point of xidStopLimit is to avoid that),
but apparently we like to keep users on their toes.

>    I suspect that learning that "vacuum to prevent wraparound" isn't a
>    problem, contributes to people later ignoring "must be vacuumed within ..."
>    WARNINGS, which I've seen plenty times.

That point never occured to me, but it makes perfect intuitive sense
that users would behave that way. This phenomenon is sometimes called
alarm fatigue, It can be quite dangerous to warn people about
non-issues "out of an abundance of caution".

> 3) Autovacuums triggered by tuple thresholds persistently getting cancelled
>    also regularly causes outages, and they make it more likely that an
>    eventual age-based vacuum will take forever.

Really? Outages? I imagine that you'd have to be constantly hammering
the table with DDL before it could happen. That's possible, but it
seems relatively obvious that doing that is asking for trouble.
Whereas the Manta postmortem (and every similar case that I've
personally seen) involved very nasty interactions that happened due to
the way components interacted with a workload that wasn't like that.

Running DDL from a cron job or from the application may not be a great
idea, but it's also quite common.

> Aspect 1) is addressed to a good degree by the proposed split of anti-wrap
> into an age and anti-cancel triggers. And could be improved by reporting
> failsafe autovacuums in pg_stat_activity.

What you call aspect 2 (the issue with disastrous HW lock traffic jams
involving TRUNCATE being run from a cron job, etc) is a big goal of
mine for this patch series. You seem unsure of how effective my
approach (or an equally simple approach based on table age heuristics)
will be. Is that the case?

> Perhaps 2) could be improved a bit by emitting a WARNING message when we
> didn't cancel AV because it was anti-wraparound?  But eventually I think we
> somehow need to signal the "intent" of the lock drop down into ProcSleep() or
> wherever it'd be.

That's doable, but definitely seems like separate work.

> I have two ideas around 3):
>
> First, we could introduce thresholds for the tuple thresholds, after which
> autovacuum isn't concealable anymore.

Do you think that's a good idea? I just don't trust those statistics,
at all. As in, I think they're often complete garbage.

> Second, we could track the number of cancellations since the last [auto]vacuum
> in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been
> cancelled a number of times.

In theory that would be better than an approach along the lines I've
proposed, because it is directly based on a less aggressive approach
being tried a few times, and failing a few times. That part I like.

However, I also don't like several things about this approach. First
of all it relies on relatively complicated infrastructure, for
something that can be critical. Second, it will be hard to test.
Third, perhaps it would make sense to give the earlier/less aggressive
approach (a table age av that is still autocancellable) quite a large
number of attempts before giving up. If the table age isn't really
growing too fast, why not continue to be patient, possibly for quite a
long time?

Perhaps a hybrid strategy could be useful? Something like what I came
up with already, *plus* a mechanism that gives up after (say) 1000
cancellations, and escalates to no-auto-cancel, regardless of table
age. It seems sensible to assume that a less aggressive approach is
just hopeless relatively quickly (in wall clock time and logical XID
time) once we see sufficiently many cancellations against the same
table.

--
Peter Geoghegan



Hi,

On 2023-01-13 19:39:41 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 6:09 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think the split is right. There's too much in 0001 - it's basically
> > introducing the terminology of 0002 already. Could you make it a much more
> > minimal change?
>
> Okay.
>
> I thought that you might say that. I really just wanted to show you
> how small the code footprint was for the more controversial part.

I don't mind you splitting this into three parts ;)



> I believe that we agree on most things when it comes to VACUUM, but
> I'm pretty sure that we still disagree about the value in setting
> autovacuum_freeze_max_age very high. I continue to believe that
> setting it over a billion or so is just a bad idea. I'm mentioning
> this only because it might give you some idea of where I'm coming from
> -- in general I believe that age-based settings often have only very
> weak relationships with what actually matters.

I think part of our difference around a high autovacuum_freeze_max_age are due
to things you're trying to address here - if no-auto-cancel is a separate
threshold from autovacuum_freeze_max_age, it's less problematic to set
autovacuum_freeze_max_age to something lower.

But yes, there's a remaining difference of opinion / experience.



> It might make sense to always give a small fixed amount of headroom
> when autovacuum_freeze_max_age is set to very high values. Maybe just
> 5 million XIDs/MXIDs. That would probably be just as effective as
> (say) 500 million in almost all cases. But then you have to accept
> another magic number.

I suspect that most systems with a high autovacuum_freeze_max_age use it
because 200m leads to too frequent autovacuums - a few million won't do much
for those.


How about a float autovacuum_no_auto_cancel_age where positive values are
treated as absolute values, and negative values are a multiple of
autovacuum_freeze_max_age? And where the "computed" age is capped at
vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.

And maybe a default setting of -1.8 or so?

If a user chooses to set autovacuum_no_auto_cancel_age and vacuum_failsafe_age
to 2.1 billion, oh well, that's really not our problem.



> > 1) It regularly scares the crap out of users, even though it's normal.  This
> >    is further confounded by failsafe autovacuums, where a scared reaction is
> >    appropriate, not being visible in pg_stat_activity.
>
> The docs actually imply that when the system reaches the point of
> entering xidStopLimit mode, you might get data corruption. Of course
> that's not true (the entire point of xidStopLimit is to avoid that),
> but apparently we like to keep users on their toes.

Well, historically there wasn't all that much protection. And I suspect there
still might be some dragons. We really need to get rid of the remaining places
that cache 32bit xids across transactions.



> > 3) Autovacuums triggered by tuple thresholds persistently getting cancelled
> >    also regularly causes outages, and they make it more likely that an
> >    eventual age-based vacuum will take forever.
>
> Really? Outages? I imagine that you'd have to be constantly hammering
> the table with DDL before it could happen. That's possible, but it
> seems relatively obvious that doing that is asking for trouble.

Yes, due to queries slowing down due to the bloat, or due to running out of
space.

I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
even DDL, they just needed to make the table readonly, for a brief moment, a
few times a day. The problem started once the indexes grew to be large enough
that the time for (re-)finding dead tuples + and an index scan phase got large
enough that they were unlucky to be killed a few times in a row. After that
autovac never got through the index scan phase. Not doing any "new" work, just
collecting the same dead tids over and over, scanning the indexes for those
tids, never quite finishing.


You really don't need that frequent lock conflicts. It just needs to be more
frequent than what the index scans portion of vacuuming takes, which can be a
long time with large indexes.



> > Aspect 1) is addressed to a good degree by the proposed split of anti-wrap
> > into an age and anti-cancel triggers. And could be improved by reporting
> > failsafe autovacuums in pg_stat_activity.
>
> What you call aspect 2 (the issue with disastrous HW lock traffic jams
> involving TRUNCATE being run from a cron job, etc) is a big goal of
> mine for this patch series. You seem unsure of how effective my
> approach (or an equally simple approach based on table age heuristics)
> will be. Is that the case?

I am was primarily concerned about situations where an admin interactively was
trying to get rid of the table holding back the xid horizon, after some
problem caused a lot of work to pile up.

If you know postgres internals, it's easy. You know that autovac stops
auto-cancelling after freeze_max_age and you know that the lock queue is
ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
afterwards cancel the autovac worker. But if you aren't, it'll take a while to
figure out that the DROP TABLE isn't progressing due to a lock conflict, at
which point you'll cancel the statement (possibly having wrecked havoc with
all other accesses). Then you figure out that you need to cancel
autovacuum. After you try the DROP TABLE again - but the next worker has
gotten to work on the table.

Brr.


The bad cousin of this is when you can't even drop the table due to "not
accepting commands". I don't know if you've seen people try to start postgres
in single user mode, under pressure, for the first time. It ain't pretty.




> > I have two ideas around 3):
> >
> > First, we could introduce thresholds for the tuple thresholds, after which
> > autovacuum isn't concealable anymore.
>
> Do you think that's a good idea? I just don't trust those statistics,
> at all. As in, I think they're often complete garbage.

I have seen them be reasonably accurate in plenty busy systems. The most
common problem I've seen is that the amount of assumed dead / inserted tuples
is way too high, because the server crash-restarted at some point.  We now
have most of the infrastructure for using a slightly older version of stats
after a crash, which'd make them less inaccurate.

I haven't recently seen crazy over-estimates of the number of dead tuples, at
least if you count dead tids as tuples.  It might be worth to split the dead
tuple count into a dead tuples and dead items.

Are you mostly seen over or under estimates?


> > Second, we could track the number of cancellations since the last [auto]vacuum
> > in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been
> > cancelled a number of times.
>
> In theory that would be better than an approach along the lines I've
> proposed, because it is directly based on a less aggressive approach
> being tried a few times, and failing a few times. That part I like.
>
> However, I also don't like several things about this approach. First
> of all it relies on relatively complicated infrastructure, for
> something that can be critical.

I don't think it relies on much more machinery than we already rely on? The
dead/inserted tuple estimates and come from pgstat already. The only really
new piece would be that the worker would need to do a
pgstat_report_autovac_failure() after getting cancelled, which doesn't seem
too hard?


> Second, it will be hard to test.

It doesn't seem too bad. A short naptime, cancelling autovac in a loop, and
the threshold should quickly be reached?  We also could add a helper function
to artificially increase the failure count.


> Third, perhaps it would make sense to give the earlier/less aggressive
> approach (a table age av that is still autocancellable) quite a large
> number of attempts before giving up. If the table age isn't really
> growing too fast, why not continue to be patient, possibly for quite a
> long time?

Yea, that's true. But it's also easy to get to the point that you collect so
much "debt" that it'll be hard to get out of. Particularly once the dead items
space doesn't fit all the dead tids, the subsequent parts of the table won't
get processed by vacuum if it frequently is cancelled before the index scans
end, leading to a huge amount fo work later.



> Perhaps a hybrid strategy could be useful? Something like what I came
> up with already, *plus* a mechanism that gives up after (say) 1000
> cancellations, and escalates to no-auto-cancel, regardless of table
> age. It seems sensible to assume that a less aggressive approach is
> just hopeless relatively quickly (in wall clock time and logical XID
> time) once we see sufficiently many cancellations against the same
> table.

Maybe. Making it explainable is presumably the hard part. We've historically
failed to make this area understandable, so maybe we don't need to try :)


Somehow this made me think of a somewhat crazy, and largely unrelated, idea:
Why don't we use the the currently unused VM bit combination to indicate pages
with dead tids? We could have an [auto]vacuum mode where it scans just pages
with the dead tids bit set. Particularly when on-access heap pruning is doing
most of the work, that could be quite useful to more cheaply get rid of the
dead tids. Obviously we'd also set them when vacuum decides / was told not to
do index cleanup. Yes, it'd obviously be less effective than lots of the
things we discussed in this area (needing to re-collect the dead tids on the
indicated), but it'd have the advantage of not needing a lot of new
infrastructure.

Greetings,

Andres Freund



On Fri, Jan 13, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:
> > > If I understand the patch correctly, we now have the following age based
> > > thresholds for av:
> > >
> > > - force-enable autovacuum:
> > >   oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
> > > - autovacuum based on age:
> > >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> > >     tableagevac = relfrozenxid < recentXid - freeze_max_age
> > > - prevent auto-cancellation:
> > >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> > >   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
> > >   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age
> > >
> > > Is that right?
> >
> > That summary looks accurate, but I'm a bit confused about why you're
> > asking the question this way. I thought that it was obvious that the
> > patch doesn't change most of these things.
>
> For me it was helpful to clearly list the triggers when thinking about the
> issue. I found the diff hard to read and, as noted above, the logic for the
> auto cancel threshold quite confusing, so ...

I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
That looks completely magical from a user perspective. Some users
aren't going to understand autovacuum behavior at all. Some will, and
will be able to compare age(relfrozenxid) against
autovacuum_freeze_max_age. Very few people are going to think to
compare age(relfrozenxid) against some formula based on
autovacuum_freeze_max_age. I guess if we document it, maybe they will.

But even then, what's the logic behind that formula? I am not entirely
convinced that we need to separate the force-a-vacuum threshold from
the don't-cancel threshold, but if we do separate them, what's the
purpose of having the clearance between them increase as you increase
autovacuum_freeze_max_age from 0 to 500 million, and thereafter
decrease until it reaches 0 at 1 billion? I can't explain the logic
behind that except by saying "well, somebody came up with an arbitrary
formula".

I do like the idea of driving the auto-cancel behavior off of the
results of previous attempts to vacuum the table. That could be done
independently of the XID age of the table. If we've failed to vacuum
the table, say, 10 times, because we kept auto-cancelling, it's
probably appropriate to force the issue. It doesn't really matter
whether the autovacuum triggered because of bloat or because of XID
age. Letting either of those things get out of control is bad. What I
think happens fairly commonly right now is that the vacuums just keep
getting cancelled until the table's XID age gets too old, and then we
finally force the issue. But at that point a lot of harm has already
been done. In a frequently updated table, waiting 300 million XIDs to
stop cancelling the vacuum is basically condemning the user to have to
run VACUUM FULL. The table can easily be ten or a hundred times bigger
than it should be by that point.

And that's a big reason why I am skeptical about the patch as
proposed. It raises the threshold for auto-cancellation in cases where
it's sometimes already far too high.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote:
> I think part of our difference around a high autovacuum_freeze_max_age are due
> to things you're trying to address here - if no-auto-cancel is a separate
> threshold from autovacuum_freeze_max_age, it's less problematic to set
> autovacuum_freeze_max_age to something lower.

That seems like a much smaller difference of opinion than I'd imagined
it was before now.

> But yes, there's a remaining difference of opinion / experience.

Perhaps it's also a communication issue. I don't disagree with
pragmatic decisions that made sense given very particular limitations
in Postgres, which this is starting to sound like now.

When I express skepticism of very high autovacuum_freeze_max_age
settings, it's mostly just that I don't think that age(relfrozenxid)
is at all informative in the way that something that triggers
autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
less informative it becomes. I'm sure that it can be safe to use very
high autovacuum_freeze_max_age values, but it seems like a case of
using a very complicated and unreliable thing to decide when to
VACUUM.

> > It might make sense to always give a small fixed amount of headroom
> > when autovacuum_freeze_max_age is set to very high values. Maybe just
> > 5 million XIDs/MXIDs. That would probably be just as effective as
> > (say) 500 million in almost all cases. But then you have to accept
> > another magic number.
>
> I suspect that most systems with a high autovacuum_freeze_max_age use it
> because 200m leads to too frequent autovacuums - a few million won't do much
> for those.

My point was mostly that it really doesn't cost us anything, and it
could easily help, so it might be worthwhile.

I wonder if these users specifically get too many aggressive
autovacuums with lower autovacuum_freeze_max_age settings? Maybe they
even fail to get enough non-aggressive autovacuums? If that's what it
is, then that makes perfect sense to me. However, I tend to think of
this as an argument against aggressive VACUUMs, not an argument in
favor of high autovacuum_freeze_max_age settings (as I said,
communication is hard).

> How about a float autovacuum_no_auto_cancel_age where positive values are
> treated as absolute values, and negative values are a multiple of
> autovacuum_freeze_max_age? And where the "computed" age is capped at
> vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.
>
> And maybe a default setting of -1.8 or so?

I think that would work fine, as a GUC (with no reloption).

> Well, historically there wasn't all that much protection. And I suspect there
> still might be some dragons. We really need to get rid of the remaining places
> that cache 32bit xids across transactions.

Maybe, but our position is that it's supported, it's safe -- there
will be no data loss (or else it's a bug, and one that we'd take very
seriously at that). Obviously it's a bad idea to allow this to happen,
but surely having the system enter xidStopLimit is sufficient
disincentive for users.

> I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
> server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
> even DDL, they just needed to make the table readonly, for a brief moment, a
> few times a day. The problem started once the indexes grew to be large enough
> that the time for (re-)finding dead tuples + and an index scan phase got large
> enough that they were unlucky to be killed a few times in a row. After that
> autovac never got through the index scan phase. Not doing any "new" work, just
> collecting the same dead tids over and over, scanning the indexes for those
> tids, never quite finishing.

That makes sense to me, though I wonder if there would have been
another kind of outage (not the one you actually saw) had the
autocancellation behavior somehow been disabled after the first few
cancellations.

This sounds like a great argument in favor of suspend-and-resume as a
way of handling autocancellation -- no useful work needs to be thrown
away for AV to yield for a minute or two. One ambition of mine for the
visibility map snapshot infrastructure was to be able support
suspend-and-resume. It wouldn't be that hard for autovacuum to
serialize everything, and use that to pick up right where an
autovacuum worker left of at when it was autocancelled. Same
OldestXmin starting point as before, same dead_items array, same
number of scanned_pages (and pages left to scan).

> > What you call aspect 2 (the issue with disastrous HW lock traffic jams
> > involving TRUNCATE being run from a cron job, etc) is a big goal of
> > mine for this patch series. You seem unsure of how effective my
> > approach (or an equally simple approach based on table age heuristics)
> > will be. Is that the case?
>
> I am was primarily concerned about situations where an admin interactively was
> trying to get rid of the table holding back the xid horizon, after some
> problem caused a lot of work to pile up.

Fair enough, but the outages that I'm mostly thinking about here
weren't really like that. There wasn't anything holding back the
horizon, at any point. It was just that the autocancellation behavior
was disruptive in some critical way.

> If you know postgres internals, it's easy. You know that autovac stops
> auto-cancelling after freeze_max_age and you know that the lock queue is
> ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
> afterwards cancel the autovac worker. But if you aren't, it'll take a while to
> figure out that the DROP TABLE isn't progressing due to a lock conflict, at
> which point you'll cancel the statement (possibly having wrecked havoc with
> all other accesses). Then you figure out that you need to cancel
> autovacuum. After you try the DROP TABLE again - but the next worker has
> gotten to work on the table.

Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
special cases? Maybe they should always be able to auto cancel an
autovacuum?

> > > I have two ideas around 3):
> > >
> > > First, we could introduce thresholds for the tuple thresholds, after which
> > > autovacuum isn't concealable anymore.
> >
> > Do you think that's a good idea? I just don't trust those statistics,
> > at all. As in, I think they're often complete garbage.
>
> I have seen them be reasonably accurate in plenty busy systems. The most
> common problem I've seen is that the amount of assumed dead / inserted tuples
> is way too high, because the server crash-restarted at some point.  We now
> have most of the infrastructure for using a slightly older version of stats
> after a crash, which'd make them less inaccurate.

Did you mean way too low?

> I haven't recently seen crazy over-estimates of the number of dead tuples, at
> least if you count dead tids as tuples.  It might be worth to split the dead
> tuple count into a dead tuples and dead items.
>
> Are you mostly seen over or under estimates?

It's not so much what I've seen. It's that the actual approach has
lots of problems.

Referring to my notes, here are what seem to me to be serious problems:

* We are very naive about what a dead tuple even means, and we totally
fail to account for the fact that only the subset of heap pages that
are PageIsAllVisible() are interesting to VACUUM -- focusing on the
whole table just seems wrong.

Per https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com

* Stub LP_DEAD items in heap pages are extremely different to dead
heap-only tuples in heap pages, which we ignore.

Per https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com

* Problem where insert-driven autovacuums
(autovacuum_vacuum_insert_threshold/autovacuum_vacuum_insert_scale_factor
triggers AVs) become further spaced apart as a consequence of one
VACUUM operation taking far longer than usual (generally because it's
an aggressive VACUUM that follows several non-aggressive VACUUMs).

Per https://postgr.es/m/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g@mail.gmail.com

It's quite possible to get approximately the desired outcome with an
algorithm that is completely wrong -- the way that we sometimes need
autovacuum_freeze_max_age to deal with bloat is a great example of
that. Even then, there may still be serious problems that are well
worth solving.

> > However, I also don't like several things about this approach. First
> > of all it relies on relatively complicated infrastructure, for
> > something that can be critical.
>
> I don't think it relies on much more machinery than we already rely on? The
> dead/inserted tuple estimates and come from pgstat already. The only really
> new piece would be that the worker would need to do a
> pgstat_report_autovac_failure() after getting cancelled, which doesn't seem
> too hard?

Yeah, but it is a new dependency on stored state. Certainly doable,
but hard enough that it might be better to add that part later on.

> Somehow this made me think of a somewhat crazy, and largely unrelated, idea:
> Why don't we use the the currently unused VM bit combination to indicate pages
> with dead tids? We could have an [auto]vacuum mode where it scans just pages
> with the dead tids bit set. Particularly when on-access heap pruning is doing
> most of the work, that could be quite useful to more cheaply get rid of the
> dead tids. Obviously we'd also set them when vacuum decides / was told not to
> do index cleanup. Yes, it'd obviously be less effective than lots of the
> things we discussed in this area (needing to re-collect the dead tids on the
> indicated), but it'd have the advantage of not needing a lot of new
> infrastructure.

I wonder if it would be possible to split up the work of VACUUM into
multiple phases that can be processed independently. The dead_items
array could be serialized and stored in a temp file. That's not the
same as some of the more complicated stuff we talked about in the last
couple of years, such as a dedicated fork for Dead TIDs. It's more
like an extremely flexible version of the same basic design for
VACUUM, with the ability to slow down and speed up based on a system
level view of things (e.g., checkpointing information). And with index
vacuuming happening on a highly deferred timeline in some cases.
Possibly we could make each slice of work processed by any available
autovacuum worker. Autovacuum workers could become "headless".

You would need some kind of state machine to make sure that critical
dependencies were respected (e.g., always do the heap vacuuming step
after all indexes are vacuumed), but that possibly isn't that hard,
and still gives you a lot.

As for this patch of mine: do you think that it would be acceptable to
pursue a version based on your autovacuum_no_auto_cancel_age design
for 16? Perhaps this can include something like
pgstat_report_autovac_failure(). It's not even the work of
implementing pgstat_report_autovac_failure() that creates risk that
it'll miss the 16 feature freeze deadline. I'm more concerned that
introducing a more complicated design will lead to the patch being
bikeshedded to death.

--
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> That looks completely magical from a user perspective. Some users
> aren't going to understand autovacuum behavior at all. Some will, and
> will be able to compare age(relfrozenxid) against
> autovacuum_freeze_max_age. Very few people are going to think to
> compare age(relfrozenxid) against some formula based on
> autovacuum_freeze_max_age. I guess if we document it, maybe they will.

What do you think of Andres' autovacuum_no_auto_cancel_age proposal?

As I've said several times already, I am by no means attached to the
current formula.

> I do like the idea of driving the auto-cancel behavior off of the
> results of previous attempts to vacuum the table. That could be done
> independently of the XID age of the table.

Even when the XID age of the table has already significantly surpassed
autovacuum_freeze_max_age, say due to autovacuum worker starvation?

> If we've failed to vacuum
> the table, say, 10 times, because we kept auto-cancelling, it's
> probably appropriate to force the issue.

I suggested 1000 times upthread. 10 times seems very low, at least if
"number of times cancelled" is the sole criterion, without any
attention paid to relfrozenxid age or some other tiebreaker.

> It doesn't really matter
> whether the autovacuum triggered because of bloat or because of XID
> age. Letting either of those things get out of control is bad.

While inventing a new no-auto-cancel behavior that prevents bloat from
getting completely out of control may well have merit, I don't see why
it needs to be attached to this other effort.

I think that the vast majority of individual tables have autovacuums
cancelled approximately never, and so my immediate concern is
ameliorating cases where not being able to auto-cancel once in a blue
moon causes an outage. Sure, the opposite problem also exists, and I
think that it would be really bad if it was made significantly worse
as an unintended consequence of a patch that addressed just the first
problem. But that doesn't mean we have to solve both problems together
at the same time.

> But at that point a lot of harm has already
> been done. In a frequently updated table, waiting 300 million XIDs to
> stop cancelling the vacuum is basically condemning the user to have to
> run VACUUM FULL. The table can easily be ten or a hundred times bigger
> than it should be by that point.

The rate at which relfrozenxid ages is just about useless as a proxy
for how much wall clock time has passed with a given workload --
workloads are usually very bursty. It's much worse still as a proxy
for what has changed in the table; completely static tables have their
relfrozenxid age at exactly the same rate as the most frequently
updated table in the same database (the table that "consumes the most
XIDs"). So while antiwraparound autovacuum no-auto-cancel behavior may
indeed save the user from problems with serious bloat, it will happen
pretty much by mistake. Not that it doesn't happen all the same -- of
course it does.

That factor (the mistake factor) doesn't mean I take the point any
less seriously. What I don't take seriously is the idea that the
precise XID age was ever crucially important.

More generally, I just don't accept that this leaves with no room for
something along the lines of my proposed, such as Andres'
autovacuum_freeze_max_age concept. As I've said already, there will
usually be a very asymmetric quality to the problem in cases like the
Joyent outage. Even a modest amount of additional XID-space-headroom
will very likely be all that will be needed at the critical juncture.
It may not be perfect, but it still has every potential to make things
safer for some users, without making things any less safe for other
users.

--
Peter Geoghegan



On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> > That looks completely magical from a user perspective. Some users
> > aren't going to understand autovacuum behavior at all. Some will, and
> > will be able to compare age(relfrozenxid) against
> > autovacuum_freeze_max_age. Very few people are going to think to
> > compare age(relfrozenxid) against some formula based on
> > autovacuum_freeze_max_age. I guess if we document it, maybe they will.
>
> What do you think of Andres' autovacuum_no_auto_cancel_age proposal?

I like it better than your proposal. I don't think it's a fundamental
improvement and I would rather see a fundamental improvement, but I
can see it being better than nothing.

> > I do like the idea of driving the auto-cancel behavior off of the
> > results of previous attempts to vacuum the table. That could be done
> > independently of the XID age of the table.
>
> Even when the XID age of the table has already significantly surpassed
> autovacuum_freeze_max_age, say due to autovacuum worker starvation?
>
> > If we've failed to vacuum
> > the table, say, 10 times, because we kept auto-cancelling, it's
> > probably appropriate to force the issue.
>
> I suggested 1000 times upthread. 10 times seems very low, at least if
> "number of times cancelled" is the sole criterion, without any
> attention paid to relfrozenxid age or some other tiebreaker.

Hmm, I think that a threshold of 1000 is far too high to do much good.
By the time we've tried to vacuum a table 1000 times and failed every
time, I anticipate that the situation will be pretty dire, regardless
of why we thought the table needed to be vacuumed in the first place.
In the best case, with autovacum_naptime=1minute, failing 1000 times
means that we've delayed vacuuming the table for at least 16 hours.
That's assuming that there's a worker available to retry every minute
and that we fail quickly. If it's a 2 hour vacuum operation and we
typically fail about halfway through, it could take us over a month to
hit 1000 failures. There are many tables out there that get enough
inserts, updates, and deletes that a 16-hour delay will result in
irreversible bloat, never mind a 41-day delay. After even a few days,
wraparound could become critical even if bloat isn't.

I'm not sure why a threshold of 10 would be too low. It seems to me
that if we fail ten times in a row to vacuum a table and fail for the
same reason every time, we're probably going to keep failing for that
reason. If that is true, we will be better off if we force the issue
sooner rather than later. There's no value in letting the table bloat
out the wazoo and the cluster approach a wraparound shutdown before we
insist. Consider a more mundane example. If I try to start my car or
my dishwasher or my computer or my toaster oven ten times and it fails
ten times in a row, and the failure mode appears to be the same each
time, I am not going to sit there and try 990 more times hoping things
get better, because that seems very unlikely to help. Honestly,
depending on the situation, I might not even get to ten times before I
switch to doing some form of troubleshooting and/or calling someone
who could repair the device.

In fact I think there's a decent argument that a threshold of ten is
possibly too high here, too. If you wait until the tenth try before
you try not auto-cancelling, then a table with a workload that makes
auto-cancelling 100% probable will get vacuumed 10% as often as it
would otherwise. I think there are cases where that would be OK, but
probably on the whole it's not going to go very well. The only problem
I see with lowering the threshold below ~10 is that the signal starts
to get weak. If something fails for the same reason ten times in a row
you can be pretty sure it's a chronic problem. If you made the
thereshold say three you'd probably start making bad decisions
sometimes -- you'd think that you had a chronic problem when really
you just got a bit unlucky.

To get back to the earlier question above, I think that if the
retries-before-not-auto-cancelling threshold were low enough to be
effective, you wouldn't necessarily need to consider XID age as a
second reason for not auto-cancelling. You would want to force the
behavior anyway when you hit emergency mode, because that should force
all the mitigations we have, but I don't know that you need to do
anything before that.

> > It doesn't really matter
> > whether the autovacuum triggered because of bloat or because of XID
> > age. Letting either of those things get out of control is bad.
>
> While inventing a new no-auto-cancel behavior that prevents bloat from
> getting completely out of control may well have merit, I don't see why
> it needs to be attached to this other effort.

It doesn't, but I think it would be a lot more beneficial than just
adding a new GUC. A lot of the fundamental stupidity of autovacuum
comes from its inability to consider the context. I've had various
ideas over the years about how to fix that, but this is far simpler
than some things I've thought about and I think it would help a lot of
people.

> That factor (the mistake factor) doesn't mean I take the point any
> less seriously. What I don't take seriously is the idea that the
> precise XID age was ever crucially important.

I agree. That's why I think driving this off of number of previous
failures would be better than driving it off of an XID age.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote:
> When I express skepticism of very high autovacuum_freeze_max_age
> settings, it's mostly just that I don't think that age(relfrozenxid)
> is at all informative in the way that something that triggers
> autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
> less informative it becomes. I'm sure that it can be safe to use very
> high autovacuum_freeze_max_age values, but it seems like a case of
> using a very complicated and unreliable thing to decide when to
> VACUUM.

Both you and Robert said this, and I have seen it be true, but typically not
for large high-throughput OLTP databases, where I found increasing
relfrozenxid to be important. Sure, there's probably some up/down through the
day / week, but it's likely to be pretty predictable.

I think the problem is that an old relfrozenxid doesn't tell you how much
outstanding work there is. Perhaps that's what both of you meant...


I think that's not the fault of relfrozenxid as a trigger, but that we simply
don't keep enough other stats. We should imo at least keep track of:

In pg_class:

- The number of all frozen pages, like we do for all-visible

  That'd give us a decent upper bound for the amount of work we need to do to
  increase relfrozenxid. It's important that this is crash safe (thus no
  pg_stats), and it only needs to be written when we'd likely make other
  changes to the pg_class row anyway.


In pgstats:

- The number of dead items, incremented both by the heap scan and
  opportunistic pruning

  This would let us estimate how urgently we need to clean up indexes.

- The xid/mxid horizons during the last vacuum

- The number of pages with tuples that couldn't removed due to the horizon
  during the last vacuum

  Together with the horizon, this would let us avoid repeated vacuums that
  won't help. Tracking the number of pages instead of tuples allows a lot
  better cost/benefit estimation of another vacuum.

- The number of pages with tuples that couldn't be frozen

  Similar to the dead tuple one, except that it'd help avoid repeated vacuums
  to increase relfrozenxid, when it won't be able to help.


> > I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
> > server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
> > even DDL, they just needed to make the table readonly, for a brief moment, a
> > few times a day. The problem started once the indexes grew to be large enough
> > that the time for (re-)finding dead tuples + and an index scan phase got large
> > enough that they were unlucky to be killed a few times in a row. After that
> > autovac never got through the index scan phase. Not doing any "new" work, just
> > collecting the same dead tids over and over, scanning the indexes for those
> > tids, never quite finishing.
>
> That makes sense to me, though I wonder if there would have been
> another kind of outage (not the one you actually saw) had the
> autocancellation behavior somehow been disabled after the first few
> cancellations.

I suspect so, but it's hard to know.


> This sounds like a great argument in favor of suspend-and-resume as a
> way of handling autocancellation -- no useful work needs to be thrown
> away for AV to yield for a minute or two. One ambition of mine for the
> visibility map snapshot infrastructure was to be able support
> suspend-and-resume. It wouldn't be that hard for autovacuum to
> serialize everything, and use that to pick up right where an
> autovacuum worker left of at when it was autocancelled. Same
> OldestXmin starting point as before, same dead_items array, same
> number of scanned_pages (and pages left to scan).

Hm, that seems a lot of work. Without having held a lock you don't even know
whether your old dead items still apply. Of course it'd improve the situation
substantially, if we could get it.



> > If you know postgres internals, it's easy. You know that autovac stops
> > auto-cancelling after freeze_max_age and you know that the lock queue is
> > ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
> > afterwards cancel the autovac worker. But if you aren't, it'll take a while to
> > figure out that the DROP TABLE isn't progressing due to a lock conflict, at
> > which point you'll cancel the statement (possibly having wrecked havoc with
> > all other accesses). Then you figure out that you need to cancel
> > autovacuum. After you try the DROP TABLE again - but the next worker has
> > gotten to work on the table.
>
> Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> special cases? Maybe they should always be able to auto cancel an
> autovacuum?

Yea, I think so. It's not obvious how to best pass down that knowledge into
ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
way would be to change LockAcquireExtended() to get a flags argument instead
of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP &
LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for
RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR*
flags.


> > > > I have two ideas around 3):
> > > >
> > > > First, we could introduce thresholds for the tuple thresholds, after which
> > > > autovacuum isn't concealable anymore.
> > >
> > > Do you think that's a good idea? I just don't trust those statistics,
> > > at all. As in, I think they're often complete garbage.
> >
> > I have seen them be reasonably accurate in plenty busy systems. The most
> > common problem I've seen is that the amount of assumed dead / inserted tuples
> > is way too high, because the server crash-restarted at some point.  We now
> > have most of the infrastructure for using a slightly older version of stats
> > after a crash, which'd make them less inaccurate.
>
> Did you mean way too low?

Err, yes.


> > I haven't recently seen crazy over-estimates of the number of dead tuples, at
> > least if you count dead tids as tuples.  It might be worth to split the dead
> > tuple count into a dead tuples and dead items.
> >
> > Are you mostly seen over or under estimates?
>
> It's not so much what I've seen. It's that the actual approach has
> lots of problems.
>
> Referring to my notes, here are what seem to me to be serious problems:

ISTM that some of what you write below would be addressed, at least partially,
by the stats I proposed above. Particularly keeping some "page granularity"
instead of "tuple granularity" stats seems helpful.

It'd be great if we could update the page-granularity stats in
heap_{insert,multi_insert,update,delete,heapgetpage}. But without page level
flags like "has any dead tuples" that's probably too expensive.


> * We are very naive about what a dead tuple even means, and we totally
> fail to account for the fact that only the subset of heap pages that
> are PageIsAllVisible() are interesting to VACUUM -- focusing on the
> whole table just seems wrong.
>
> Per https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com


> * Stub LP_DEAD items in heap pages are extremely different to dead
> heap-only tuples in heap pages, which we ignore.
>
> Per https://postgr.es/m/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com

> * Problem where insert-driven autovacuums
> (autovacuum_vacuum_insert_threshold/autovacuum_vacuum_insert_scale_factor
> triggers AVs) become further spaced apart as a consequence of one
> VACUUM operation taking far longer than usual (generally because it's
> an aggressive VACUUM that follows several non-aggressive VACUUMs).

> Per https://postgr.es/m/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g@mail.gmail.com

Yea, that's not great. But seems fairly addressable.


> It's quite possible to get approximately the desired outcome with an
> algorithm that is completely wrong -- the way that we sometimes need
> autovacuum_freeze_max_age to deal with bloat is a great example of
> that.

Yea. I think this is part of I like my idea about tracking more observations
made by the last vacuum - they're quite easy to get right, and they
self-correct, rather than potentially ending up causing ever-wronger stats.


> Even then, there may still be serious problems that are well
> worth solving.

Right. I think it's fundamental that we get a lot better estimates about the
amount of work needed. Without that we have no chance of finishing autovacuums
before problems become too big.



> > Somehow this made me think of a somewhat crazy, and largely unrelated, idea:
> > Why don't we use the the currently unused VM bit combination to indicate pages
> > with dead tids? We could have an [auto]vacuum mode where it scans just pages
> > with the dead tids bit set. Particularly when on-access heap pruning is doing
> > most of the work, that could be quite useful to more cheaply get rid of the
> > dead tids. Obviously we'd also set them when vacuum decides / was told not to
> > do index cleanup. Yes, it'd obviously be less effective than lots of the
> > things we discussed in this area (needing to re-collect the dead tids on the
> > indicated), but it'd have the advantage of not needing a lot of new
> > infrastructure.
>
> I wonder if it would be possible to split up the work of VACUUM into
> multiple phases that can be processed independently. The dead_items
> array could be serialized and stored in a temp file. That's not the
> same as some of the more complicated stuff we talked about in the last
> couple of years, such as a dedicated fork for Dead TIDs. It's more
> like an extremely flexible version of the same basic design for
> VACUUM, with the ability to slow down and speed up based on a system
> level view of things (e.g., checkpointing information). And with index
> vacuuming happening on a highly deferred timeline in some cases.
> Possibly we could make each slice of work processed by any available
> autovacuum worker. Autovacuum workers could become "headless".

I don't know. I think the more basic idea I describe has significant
advantages - most importantly being able to target autovacuum on the work that
on-access pruning couldn't deal with. As you know it's very common that most
row versions are HOT and that the remaining dead tuples also get removed by
on-access pruning. Which often leads to running out of usable items, but
autovacuum won't be triggered because there's not all that much garbage
overall.  Without the ability for autovacuum to target such pages
aggressively, I don't think we're going to improve such common workloads a
whole lot. And serialized vacuum state won't help, because that still requires
vacuum to scan all the !all-visible pages to discover them. Most of which
won't contain dead tuples in a lot of workloads.



> As for this patch of mine: do you think that it would be acceptable to
> pursue a version based on your autovacuum_no_auto_cancel_age design
> for 16? Perhaps this can include something like
> pgstat_report_autovac_failure(). It's not even the work of
> implementing pgstat_report_autovac_failure() that creates risk that
> it'll miss the 16 feature freeze deadline. I'm more concerned that
> introducing a more complicated design will lead to the patch being
> bikeshedded to death.

I don't feel like I have a good handle on what could work for 16 and what
couldn't. Personally I think something like autovacuum_no_auto_cancel_age
would be an improvement, but I also don't quite feel satisfied with it.

Tracking the number of autovac failures seems uncontroverial and quite
beneficial, even if the rest doesn't make it in. It'd at least let users
monitor for tables where autovac is likely to swoop in in anti-wraparound
mode.

Perhaps its worth to separately track the number of times a backend would have
liked to cancel autovac, but couldn't due to anti-wrap? If changes to the
no-auto-cancel behaviour don't make it in, it'd at least allow us to collect
more data about the prevalence of the problem and in what situations it
occurs?  Even just adding some logging for that case seems like it'd be an
improvement.


I think with a bit of polish "Add autovacuum trigger instrumentation." ought
to be quickly mergeable.

Greetings,

Andres Freund



Hi,

On 2023-01-17 10:26:52 -0500, Robert Haas wrote:
> On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Mon, Jan 16, 2023 at 8:25 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> > > That looks completely magical from a user perspective. Some users
> > > aren't going to understand autovacuum behavior at all. Some will, and
> > > will be able to compare age(relfrozenxid) against
> > > autovacuum_freeze_max_age. Very few people are going to think to
> > > compare age(relfrozenxid) against some formula based on
> > > autovacuum_freeze_max_age. I guess if we document it, maybe they will.
> >
> > What do you think of Andres' autovacuum_no_auto_cancel_age proposal?
> 
> I like it better than your proposal. I don't think it's a fundamental
> improvement and I would rather see a fundamental improvement, but I
> can see it being better than nothing.

That's similar to my feelings about it.

I do think it'll be operationally nice to have at least some window where an
autovacuum is triggered due to age and where it won't prevent cancels. In many
situations it'll likely suffice if that window is autovacuum_naptime *
xids_per_sec large, but of course that's easily enough exceeded.


> > > I do like the idea of driving the auto-cancel behavior off of the
> > > results of previous attempts to vacuum the table. That could be done
> > > independently of the XID age of the table.
> >
> > Even when the XID age of the table has already significantly surpassed
> > autovacuum_freeze_max_age, say due to autovacuum worker starvation?
> >
> > > If we've failed to vacuum
> > > the table, say, 10 times, because we kept auto-cancelling, it's
> > > probably appropriate to force the issue.
> >
> > I suggested 1000 times upthread. 10 times seems very low, at least if
> > "number of times cancelled" is the sole criterion, without any
> > attention paid to relfrozenxid age or some other tiebreaker.
> 
> Hmm, I think that a threshold of 1000 is far too high to do much good.

Agreed.


> By the time we've tried to vacuum a table 1000 times and failed every
> time, I anticipate that the situation will be pretty dire, regardless
> of why we thought the table needed to be vacuumed in the first place.

Agreed.


> In the best case, with autovacum_naptime=1minute, failing 1000 times
> means that we've delayed vacuuming the table for at least 16 hours.

Perhaps it'd make sense for an auto-cancelled worker to signal the launcher to
do a cycle of vacuuming? Or even to just try to vacuum the table again
immediately? After all, we know that the table is going to be on the schedule
of the next worker immediately. Of course we shouldn't retry indefinitely, but
...


> In fact I think there's a decent argument that a threshold of ten is
> possibly too high here, too. If you wait until the tenth try before
> you try not auto-cancelling, then a table with a workload that makes
> auto-cancelling 100% probable will get vacuumed 10% as often as it
> would otherwise. I think there are cases where that would be OK, but
> probably on the whole it's not going to go very well.

That's already kind of the case - we'll only block auto-cancelling when
exceeding autovacuum_freeze_max_age, all the other autovacuums will be
cancelable.


> The only problem I see with lowering the threshold below ~10 is that the
> signal starts to get weak. If something fails for the same reason ten times
> in a row you can be pretty sure it's a chronic problem. If you made the
> thereshold say three you'd probably start making bad decisions sometimes --
> you'd think that you had a chronic problem when really you just got a bit
> unlucky.

Yea. Schema migrations in prod databases typically have to run in
single-statement or very small transactions, for obvious reasons. Needing to
lock the same table exclusively a few times during a schema migration is
pretty normal, particularly when foreign keys are involved. Getting blocked by
autovacuum in the middle of a schema migration is NASTY.

This is why I'm a bit worried that 10 might be too low... It's not absurd for
a schema migration to create 10 new tables referencing an existing table in
need of vacuuming.

Perhaps we should track when the first failure was, and take that into
account? Clearly having all 10 autovacuums on the same table cancelled is
different when those 10 cancellations happened in the last 10 *
autovacuum_naptime minutes, than when the last successful autovacuum was hours
ago.


> To get back to the earlier question above, I think that if the
> retries-before-not-auto-cancelling threshold were low enough to be
> effective, you wouldn't necessarily need to consider XID age as a
> second reason for not auto-cancelling. You would want to force the
> behavior anyway when you hit emergency mode, because that should force
> all the mitigations we have, but I don't know that you need to do
> anything before that.

Hm, without further restrictions, that has me worried. It's not crazy to have
a LOCK TABLE on a small-ish table be part of your workload - I've certainly
seen it plenty of times. Suddenly blocking on that for a few minutes, just
because a bit of bloat has collected, seems likely to cause havoc.

Greetings,

Andres Freund



On Tue, Jan 17, 2023 at 1:02 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote:
> > On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote:
> > When I express skepticism of very high autovacuum_freeze_max_age
> > settings, it's mostly just that I don't think that age(relfrozenxid)
> > is at all informative in the way that something that triggers
> > autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
> > less informative it becomes. I'm sure that it can be safe to use very
> > high autovacuum_freeze_max_age values, but it seems like a case of
> > using a very complicated and unreliable thing to decide when to
> > VACUUM.
>
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

I think so, at least in my case. The reason why the default
autovacuum_freeze_max_age is 300m is not because going over 300m is a
problem, but because we don't know how long it's going to take to run
all of the vacuums, and we're still going to be consuming XIDs in the
meantime, and we have no idea how fast we're consuming XIDs either.
For all we know it might take days, and we might be burning through
XIDs really quickly, so we might need a ton of headroom.

That's not to say that raising the setting might not be a sensible
thing to do in a context where you know what the workload is. If you
know that the vacuums will completely quickly OR that the XID burn
rate is low, you can raise it drastically and be fine. We just can't
assume that will be true everywhere.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:
>
> In pg_class:
>
> - The number of all frozen pages, like we do for all-visible
>
>   That'd give us a decent upper bound for the amount of work we need to do to
>   increase relfrozenxid. It's important that this is crash safe (thus no
>   pg_stats), and it only needs to be written when we'd likely make other
>   changes to the pg_class row anyway.

I'm not sure how useful this is because a lot of the work is from
scanning the indexes.

> In pgstats:
>
> - The number of dead items, incremented both by the heap scan and
>   opportunistic pruning
>
>   This would let us estimate how urgently we need to clean up indexes.

I don't think this is true because btree indexes are self-cleaning in
some scenarios and not in others.

> - The xid/mxid horizons during the last vacuum
>
> - The number of pages with tuples that couldn't removed due to the horizon
>   during the last vacuum
>
> - The number of pages with tuples that couldn't be frozen

Not bad to know, but if the horizon we could use advances by 1, we
can't tell whether that allows pruning nothing additional or another
billion tuples.

I'm not trying to take the position that XID age is a totally useless
metric. I don't think it is. If XID age is high, you know you have a
problem, and the higher it is, the more urgent that problem is.
Furthemore, if XID is low, you know that you don't have that
particular problem. You might have some other one, but that's OK: this
one metric doesn't have to answer every question to be useful.
However, where XID age really falls down as a metric is that it
doesn't tell you what it's going to take to solve the problem. The
answer, at ten thousand feet, is always vacuum. But how long will that
vacuum run? We don't know. Do we need the XID horizon to advance
first, and if so, how far? We don't know. Do we need auto-cancellation
to be disabled? We don't know. That's where we get into a lot of
trouble here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

That's what I meant, yes.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:

If you assume that there is chronic undercounting of dead tuples
(which I think is very common), then of course anything that triggers
vacuuming is going to help with that problem -- it might be totally
inadequate, but still make the critical difference by not allowing the
system to become completely destabilized. I absolutely accept that
users that are relying on that exist, and that those users ought to
not have things get even worse -- I'm pragmatic. But overall, what we
should be doing is fixing the real problem, which is that the dead
tuples accounting is deeply flawed. Actually, it's not just that the
statistics are flat out wrong; the whole model is flat-out wrong.

The assumptions that work well for optimizer statistics quite simply
do not apply here. Random sampling for this is just wrong, because
we're not dealing with something that follows a distribution that can
be characterized with a sufficiently large sample. With optimizer
statistics, the entire contents of the table is itself a sample taken
from the wider world -- so even very stale statistics can work quite
well (assuming that the schema is well normalized). Whereas the
autovacuum dead tuples stuff is characterized by constant change. I
mean of course it is -- that's the whole point! The central limit
theorem obviously just doesn't work for something like this  -- we
cannot generalize from a sample, at all.

I strongly suspect that it still wouldn't be a good model even if the
information was magically always correct. It might actually be worse
in some ways! Most of my arguments against the model are not arguments
against the accuracy of the statistics as such. They're actually
arguments against the fundamental relevance of the information itself,
to the actual problem at hand. We are not interested in information
for its own sake; we're interested in making better decisions about
autovacuum scheduling. Those may only have a very loose relationship.

How many dead heap-only tuples are equivalent to one LP_DEAD item?
What about page-level concentrations, and the implication for
line-pointer bloat? I don't have a good answer to any of these
questions myself. And I have my doubts that there are *any* good
answers. Even these questions are the wrong questions (they're just
less wrong). Fundamentally, we're deciding when the next autovacuum
should run against each table. Presumably it's going to have to happen
some time, and when it does happen it happens to the table as a whole.
And with a larger table it probably doesn't matter if it happens +/- a
few hours from some theoretical optimal time. Doesn't it boil down to
that?

If we taught the system to do the autovacuum work early because it's a
relatively good time for it from a system level point of view (e.g.
it's going to be less disruptive right now), that would be useful and
easy to justify on its own terms. But it would also tend to make the
system much less vulnerable to undercounting dead tuples, since in
practice there'd be a decent chance of getting to them early enough
that it at least wasn't extremely painful any one time. It's much
easier to understand that the system is quiescent than it is to
understand bloat.

BTW, I think that the insert-driven autovacuum stuff added to 13 has
made the situation with bloat significantly better. Of course it
wasn't really designed to do that at all, but it still has, kind of by
accident, in roughly the same way that antiwraparound autovacuums help
with bloat by accident. So maybe we should embrace "happy accidents"
like that a bit more. It doesn't necessarily matter if we do the right
thing for a reason that turns out to have not been the best reason.
I'm certainly not opposed to it, despite my complaints about relying
on age(relfrozenxid).

> In pgstats:
> (Various stats)

Overall, what I like about your ideas here is the emphasis on bounding
the worst case, and the emphasis on the picture at the page level over
the tuple level.

I'd like to use the visibility map more for stuff here, too. It is
totally accurate about all-visible/all-frozen pages, so many of my
complaints about statistics don't really apply. Or need not apply, at
least. If 95% of a table's pages are all-frozen in the VM, then of
course it's pretty unlikely to be the right time to VACUUM the table
if it's to clean up bloat -- this is just about the most reliable
information we have access to.

I think that the only way that more stats can help is by allowing us
to avoid doing completely the wrong thing more often. Just avoiding
disaster is a great goal for us here.

> > This sounds like a great argument in favor of suspend-and-resume as a
> > way of handling autocancellation -- no useful work needs to be thrown
> > away for AV to yield for a minute or two.

> Hm, that seems a lot of work. Without having held a lock you don't even know
> whether your old dead items still apply. Of course it'd improve the situation
> substantially, if we could get it.

I don't think it's all that much work, once the visibility map
snapshot infrastructure is there.

Why wouldn't your old dead items still apply? The TIDs must always
reference LP_DEAD stubs. Those can only be set LP_UNUSED by VACUUM,
and presumably VACUUM can only run in a way that either resumes the
suspended VACUUM session, or discards it altogether. So they're not
going to be invalidated during the period that a VACUUM is suspended,
in any way. Even if CREATE INDEX runs against the table during a
suspending VACUUM, we know that the existing LP_DEAD dead_items won't
have been indexed, so they'll be safe to mark LP_UNUSED in any case.

What am I leaving out? I can't think of anything. The only minor
caveat is that we'd probably have to discard the progress from any
individual ambulkdelete() call that happened to be running at the time
that VACUUM was interrupted.

> > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> > special cases? Maybe they should always be able to auto cancel an
> > autovacuum?
>
> Yea, I think so. It's not obvious how to best pass down that knowledge into
> ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
> way would be to change LockAcquireExtended() to get a flags argument instead
> of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP &
> LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for
> RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR*
> flags.

It would be tricky, but still relatively straightforward compared to
other things. It is often a TRUNCATE or a DROP TABLE, and we have
nothing to lose and everything to gain by changing the rules for
those.

> ISTM that some of what you write below would be addressed, at least partially,
> by the stats I proposed above. Particularly keeping some "page granularity"
> instead of "tuple granularity" stats seems helpful.

That definitely could be true, but I think that my main concern is
that we completely rely on randomly sampled statistics (except with
antiwraparound autovacuums, which happen on a schedule that has
problems of its own).

> > It's quite possible to get approximately the desired outcome with an
> > algorithm that is completely wrong -- the way that we sometimes need
> > autovacuum_freeze_max_age to deal with bloat is a great example of
> > that.
>
> Yea. I think this is part of I like my idea about tracking more observations
> made by the last vacuum - they're quite easy to get right, and they
> self-correct, rather than potentially ending up causing ever-wronger stats.

I definitely think that there is a place for that. It has the huge
advantage of lessening our reliance on random sampling.

> Right. I think it's fundamental that we get a lot better estimates about the
> amount of work needed. Without that we have no chance of finishing autovacuums
> before problems become too big.

I like the emphasis on bounding the work required, so that it can be
spread out, rather than trying to predict dead tuples. Again, we
should focus on avoiding disaster.

> And serialized vacuum state won't help, because that still requires
> vacuum to scan all the !all-visible pages to discover them. Most of which
> won't contain dead tuples in a lot of workloads.

The main advantage of that model is that it decides what to do and
when to do it based on the actual state of the table (or the state in
the recent past). If we see a concentration of LP_DEAD items, then we
can hurry up index vacuuming. If not, maybe we'll take our time.
Again, less reliance on random sampling is a very good thing. More
dynamic decisions that are made at runtime, and delayed for as long as
possible, just seems much more promising than having better stats that
are randomly sampled.

> I don't feel like I have a good handle on what could work for 16 and what
> couldn't. Personally I think something like autovacuum_no_auto_cancel_age
> would be an improvement, but I also don't quite feel satisfied with it.

I don't either; but it should be strictly less unsatisfactory.

> Tracking the number of autovac failures seems uncontroverial and quite
> beneficial, even if the rest doesn't make it in. It'd at least let users
> monitor for tables where autovac is likely to swoop in in anti-wraparound
> mode.

I'll see what I can come up with.

> Perhaps its worth to separately track the number of times a backend would have
> liked to cancel autovac, but couldn't due to anti-wrap? If changes to the
> no-auto-cancel behaviour don't make it in, it'd at least allow us to collect
> more data about the prevalence of the problem and in what situations it
> occurs?  Even just adding some logging for that case seems like it'd be an
> improvement.

Hmm, maybe.

> I think with a bit of polish "Add autovacuum trigger instrumentation." ought
> to be quickly mergeable.

Yeah, I'll try to get that part out of the way quickly.

--
Peter Geoghegan



Hi,

On 2023-01-17 14:57:27 -0500, Robert Haas wrote:
> > In pg_class:
> >
> > - The number of all frozen pages, like we do for all-visible
> >
> >   That'd give us a decent upper bound for the amount of work we need to do to
> >   increase relfrozenxid. It's important that this is crash safe (thus no
> >   pg_stats), and it only needs to be written when we'd likely make other
> >   changes to the pg_class row anyway.
> 
> I'm not sure how useful this is because a lot of the work is from
> scanning the indexes.

We can increase relfrozenxid before the index scans, unless we ran out of dead
tuple space. We already have code for that in failsafe mode, in some way. But
we really also should increase


> > In pgstats:
> >
> > - The number of dead items, incremented both by the heap scan and
> >   opportunistic pruning
> >
> >   This would let us estimate how urgently we need to clean up indexes.
> 
> I don't think this is true because btree indexes are self-cleaning in
> some scenarios and not in others.

I mainly meant it from the angle of whether need to clean up dead items in the
heap to avoid the table from bloating because we stop using those pages -
which requires index scans. But even for the index scan portion, it'd give us
a better bound than we have today.

We probably should track the number of killed tuples in indexes.


> > - The xid/mxid horizons during the last vacuum
> >
> > - The number of pages with tuples that couldn't removed due to the horizon
> >   during the last vacuum
> >
> > - The number of pages with tuples that couldn't be frozen
> 
> Not bad to know, but if the horizon we could use advances by 1, we
> can't tell whether that allows pruning nothing additional or another
> billion tuples.

Sure. But it'd be a lot better than scanning it again and again when nothing
has changed because thing still holds back the horizon. We could improve upon
it later by tracking the average or even bins of ages.


> However, where XID age really falls down as a metric is that it
> doesn't tell you what it's going to take to solve the problem. The
> answer, at ten thousand feet, is always vacuum. But how long will that
> vacuum run? We don't know. Do we need the XID horizon to advance
> first, and if so, how far? We don't know. Do we need auto-cancellation
> to be disabled? We don't know. That's where we get into a lot of
> trouble here.

Agreed. I think the metrics I proposed would help some, by at least providing
sensible upper boundaries (for work) and minimal requirements (horizon during
last vacuum).

Greetings,

Andres Freund



On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If you assume that there is chronic undercounting of dead tuples
> (which I think is very common), ...

Why do you think that?

> How many dead heap-only tuples are equivalent to one LP_DEAD item?
> What about page-level concentrations, and the implication for
> line-pointer bloat? I don't have a good answer to any of these
> questions myself.

Seems a bit pessimistic. If we had unlimited resources and all
operations were infinitely fast, the optimal strategy would be to
vacuum after every insert, update, or delete. But in reality, that
would be prohibitively expensive, so we're making a trade-off.
Batching together cleanup for many table modifications reduces the
amortized cost of cleaning up after one such operation very
considerably. That's critical. But if we batch too much together, then
the actual cleanup doesn't happen soon enough to keep us out of
trouble.

If we had an oracle that could provide us with perfect information,
we'd ask it, among other things, how much work will be required to
vacuum right now, and how much benefit would we get out of doing so.
The dead tuple count is related to the first question. It's not a
direct, linear relationship, but it's not completely unrelated,
either. Maybe we could refine the estimates by gathering more or
different statistics than we do now, but ultimately it's always going
to be a trade-off between getting the work done sooner (and thus maybe
preventing table growth or a wraparound shutdown) and being able to do
more work at once (and thus being more efficient). The current system
set of counters predates HOT and the visibility map, so it's not
surprising if needs updating, but if you're argue that the whole
concept is just garbage, I think that's an overreaction.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Tue, Jan 17, 2023 at 2:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > If you assume that there is chronic undercounting of dead tuples
> > (which I think is very common), ...
>
> Why do you think that?

For the reasons I gave about statistics, random sampling, the central
limit theorem. All that stuff. This matches the experience of Andres.
And is obviously the only explanation behind the reliance on
antiwraparound autovacuums for cleaning up bloat in larger OLTP
databases. It just fits: the dead tuples approach can sometimes be so
completely wrong that even an alternative triggering condition based
on something that is virtually unrelated to the thing we actually care
about can do much better in practice. Consistently, reliably, for a
given table/workload.

> > How many dead heap-only tuples are equivalent to one LP_DEAD item?
> > What about page-level concentrations, and the implication for
> > line-pointer bloat? I don't have a good answer to any of these
> > questions myself.
>
> Seems a bit pessimistic. If we had unlimited resources and all
> operations were infinitely fast, the optimal strategy would be to
> vacuum after every insert, update, or delete. But in reality, that
> would be prohibitively expensive, so we're making a trade-off.

To a large degree, that's my point. I don't know how to apply this
information, so having detailed information doesn't seem like the main
problem.

> If we had an oracle that could provide us with perfect information,
> we'd ask it, among other things, how much work will be required to
> vacuum right now, and how much benefit would we get out of doing so.

And then what would we do? What about costs?

Even if we were omniscient, we still wouldn't be omnipotent. We're
still subject to the laws of physics. VACUUM would still be something
that more or less works at the level of the whole table, or not at
all. So being omniscient seems kinda overrated to me. Adding more
information does not in general lead to better outcomes.

> The dead tuple count is related to the first question. It's not a
> direct, linear relationship, but it's not completely unrelated,
> either. Maybe we could refine the estimates by gathering more or
> different statistics than we do now, but ultimately it's always going
> to be a trade-off between getting the work done sooner (and thus maybe
> preventing table growth or a wraparound shutdown) and being able to do
> more work at once (and thus being more efficient). The current system
> set of counters predates HOT and the visibility map, so it's not
> surprising if needs updating, but if you're argue that the whole
> concept is just garbage, I think that's an overreaction.

What I'm arguing is that principally relying on any one thing is
garbage. If you have only one thing that creates pressure to VACUUM
then there can be a big impact whenever it turns out to be completely
wrong. Whereas if VACUUM can run because of (say) 3 moderate signals
taken together, then it's much less likely that we'll be completely
wrong. In general my emphasis is on avoiding disaster in all its
forms. Vacuuming somewhat early more often is perhaps suboptimal, but
far from a disaster. It's the kind of thing that we can manage.

By all means, let's make the dead tuples/dead items stuff less naive
(e.g. make it distinguish between LP_DEAD items and dead heap-only
tuples). But even then, we shouldn't continue to completely rely on it
in the way that we do right now. In other words, I'm fine with adding
more information that is more accurate as long as we don't continue to
make the mistake of not treating it kinda suspect, and certainly not
something to completely rely on if at all possible. In particular, we
need to think about both costs and benefits at all times.

-- 
Peter Geoghegan



On Tue, Jan 17, 2023 at 5:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Why do you think that?
>
> For the reasons I gave about statistics, random sampling, the central
> limit theorem. All that stuff. This matches the experience of Andres.
> And is obviously the only explanation behind the reliance on
> antiwraparound autovacuums for cleaning up bloat in larger OLTP
> databases. It just fits: the dead tuples approach can sometimes be so
> completely wrong that even an alternative triggering condition based
> on something that is virtually unrelated to the thing we actually care
> about can do much better in practice. Consistently, reliably, for a
> given table/workload.

Hmm, I don't know. I have no intuition one way or the other for
whether we're undercounting dead tuples, and I don't understand what
would cause us to do that. I thought that we tracked that accurately,
as part of the statistics system, not by sampling
(pg_stat_all_tables.n_dead_tup).

But, I think there are a number of other explanations for why we tend
to rely on antiwraparound vacuums more than we should.
Auto-cancellation. Skipping tables that are locked, or pages that are
pinned. A cost limit that is too low relative to the size of the
database, so that eventually all tables are in wraparound danger all
the time. The fact that we can vacuum tables uselessly, without
accomplishing anything, because the XID horizon is too new, but we
don't know that so we just try to vacuum anyway. And then we repeat
that useless work in an infinite loop. The fact that the system's idea
of when a vacuum needs to happen grows with
autovacuum_vacuum_scale_factor, but that actually gets too big too
fast, so that eventually it never triggers vacuuming at all, or at
least not before XID age does.

I think we ought to fire autovacuum_vacuum_scale_factor out of an
airlock. It's not the right model, and I think many people have been
aware that it's not the right model for a decade, and we haven't been
creative enough to come up with anything better. We *know* that you
have to lower this value for large tables or they just don't get
vacuumed often enough. That means we have some idea how often they
ought to be vacuumed. I'm sure I'm not the person who has the best
intuition on that point, but I bet people who have been responsible
for large production systems have some decent ideas in that area. We
should find out what that intuition is and come up with a new formula
that matches the intuition of people with experience doing this sort
of thing.

e.g.

1. When computing autovacuum_vacuum_threshold + table_size *
autovacuum_vacuum_scale_factor, if the result exceeds the value of a
new parameter autovacuum_vacuum_maximum_threshold, then clamp the
result to that value.

2. When computing autovacuum_vacuum_threshold + table_size *
autovacuum_vacuum_scale_factor, if the result exceeds 80% of the
number of dead TIDs we could store, clamp it to that number.

3. Change the formula altogether to use a square root or a cube root
or a logarithm somewhere.

I think we also ought to invent some sort of better cost limit system
that doesn't shoot you in the foot automatically as the database
grows. Nobody actually wants to limit the rate at which the database
vacuums stuff to a constant. What they really want to do is limit it
to a rate that is somewhat faster than the minimum rate needed to
avoid disaster. We should try to develop metrics for whether vacuum is
keeping up. I think one good one would be duty cycle -- if we have N
vacuum workers, then over a period of K seconds we could have done as
much as N*K process-seconds of vacuum work, and as little as 0. So
figure out how many seconds of vacuum work we actually did, and divide
that by N*K to get a percentage. If it's over, say, 90%, then we are
not keeping up. We should dynamically raise the cost limit until we
do. And drop it back down later when things look better.

I don't actually see any reason why dead tuples, even counted in a
relatively stupid way, isn't fundamentally good enough to get all
tables vacuumed before we hit the XID age cutoff. It doesn't actually
do that right now, but I feel like that must be because we're doing
other stupid things, not because there's anything that terrible about
the metric as such. Maybe that's wrong, but I find it hard to imagine.
If I imagine a world where vacuum always gets started when the number
of dead tuples hits some reasonable bound (rather than the
unreasonable bound that the scale factor stuff computes) and it always
cleans up those dead tuples (instead of doing a lot of work to clean
up nothing at all, or doing a lot of work to clean up only a small
fraction of those dead tuples, or cancelling itself, or skipping the
table that has the problem because it's locked, or running with an
unreasonably low cost limit, or otherwise being unable to GET THE JOB
DONE) then how do we ever reach autovacuum_freeze_max_age? I think it
would still be possible, but only if the XID consumption rate of the
server is so high that we chunk through 300 million XIDs in the time
it takes to perform an un-throttled vacuum of the table. I think
that's a real threat and will probably be a bigger one in ten years,
but it's only one of many things that are going wrong right now.

> Even if we were omniscient, we still wouldn't be omnipotent.

A sound theological point!

> We're
> still subject to the laws of physics. VACUUM would still be something
> that more or less works at the level of the whole table, or not at
> all. So being omniscient seems kinda overrated to me. Adding more
> information does not in general lead to better outcomes.

Yeah, I think that's true. In particular, it's not much use being
omniscient but stupid. It would be better to have limited information
and be smart about what you did with it.

> What I'm arguing is that principally relying on any one thing is
> garbage. If you have only one thing that creates pressure to VACUUM
> then there can be a big impact whenever it turns out to be completely
> wrong. Whereas if VACUUM can run because of (say) 3 moderate signals
> taken together, then it's much less likely that we'll be completely
> wrong. In general my emphasis is on avoiding disaster in all its
> forms. Vacuuming somewhat early more often is perhaps suboptimal, but
> far from a disaster. It's the kind of thing that we can manage.

True, although it can be overdone. An extra vacuum on a big table with
some large indexes that end up getting scanned can be very expensive
even if the table itself is almost entirely all-visible. We can't
afford to make too many mistakes in the direction of vacuuming early
in such cases.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
.

On Wed, Jan 18, 2023 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > It just fits: the dead tuples approach can sometimes be so
> > completely wrong that even an alternative triggering condition based
> > on something that is virtually unrelated to the thing we actually care
> > about can do much better in practice. Consistently, reliably, for a
> > given table/workload.
>
> Hmm, I don't know. I have no intuition one way or the other for
> whether we're undercounting dead tuples, and I don't understand what
> would cause us to do that. I thought that we tracked that accurately,
> as part of the statistics system, not by sampling
> (pg_stat_all_tables.n_dead_tup).

It's both, kind of.

pgstat_report_analyze() will totally override the
tabentry->dead_tuples information that drives autovacuum.c, based on
an estimate derived from a random sample -- which seems to me to be an
approach that just doesn't have any sound theoretical basis. So while
there is a sense in which we track dead tuples incrementally and
accurately using the statistics system, we occasionally call
pgstat_report_analyze (and pgstat_report_vacuum) like this, so AFAICT
we might as well not even bother tracking things reliably the rest of
the time.

Random sampling works because the things that you don't sample are
very well represented by the things that you do sample. That's why
even very stale optimizer statistics can work quite well (and why the
EAV anti-pattern makes query optimization impossible) -- the
distribution is often fixed, more or less. The statistics generalize
very well because the data meets certain underlying assumptions that
all data stored in a relational database is theoretically supposed to
meet. Whereas with dead tuples, the whole point is to observe and
count dead tuples so that autovacuum can then go remove the dead
tuples -- which then utterly changes the situation! That's a huge
difference.

ISTM that you need a *totally* different approach for something that's
fundamentally dynamic, which is what this really is. Think about how
the random sampling will work in a very large table with concentrated
updates. The modified pages need to outweigh the large majority of
pages in the table that can be skipped by VACUUM anyway.

I wonder how workable it would be to just teach pgstat_report_analyze
and pgstat_report_vacuum to keep out of this, or to not update the
stats unless it's to increase the number of dead_tuples...

> I think we ought to fire autovacuum_vacuum_scale_factor out of an
> airlock.

Couldn't agree more. I think that this and the underlying statistics
are the really big problem as far as under-vacuuming is concerned.

> I think we also ought to invent some sort of better cost limit system
> that doesn't shoot you in the foot automatically as the database
> grows. Nobody actually wants to limit the rate at which the database
> vacuums stuff to a constant. What they really want to do is limit it
> to a rate that is somewhat faster than the minimum rate needed to
> avoid disaster. We should try to develop metrics for whether vacuum is
> keeping up.

Definitely agree that doing some kind of dynamic updating is
promising. What we thought at the start versus what actually happened.
Something cyclic, just like autovacuum itself.

> I don't actually see any reason why dead tuples, even counted in a
> relatively stupid way, isn't fundamentally good enough to get all
> tables vacuumed before we hit the XID age cutoff. It doesn't actually
> do that right now, but I feel like that must be because we're doing
> other stupid things, not because there's anything that terrible about
> the metric as such. Maybe that's wrong, but I find it hard to imagine.

On reflection, maybe you're right here. Maybe it's true that the
bigger problem is just that the implementation is bad, even on its own
terms -- since it's pretty bad! Hard to say at this point.

Depends on how you define it, too. Statistically sampling is just not
fit for purpose here. But is that a problem with
autovacuum_vacuum_scale_factor? I may have said words that could
reasonably be interpreted that way, but I'm not prepared to blame it
on the underlying autovacuum_vacuum_scale_factor model now. It's
fuzzy.

> > We're
> > still subject to the laws of physics. VACUUM would still be something
> > that more or less works at the level of the whole table, or not at
> > all. So being omniscient seems kinda overrated to me. Adding more
> > information does not in general lead to better outcomes.
>
> Yeah, I think that's true. In particular, it's not much use being
> omniscient but stupid. It would be better to have limited information
> and be smart about what you did with it.

I would put it like this: autovacuum shouldn't ever be a sucker. It
should pay attention to disconfirmatory signals. The information that
drives its decision making process should be treated as provisional.

Even if the information was correct at one point, the contents of the
table are constantly changing in a way that could matter enormously.
So we should be paying attention to where the table is going --  and
even where it might be going -- not just where it is, or was.

> True, although it can be overdone. An extra vacuum on a big table with
> some large indexes that end up getting scanned can be very expensive
> even if the table itself is almost entirely all-visible. We can't
> afford to make too many mistakes in the direction of vacuuming early
> in such cases.

No, but we can afford to make some -- and can detect when it happened
after the fact. I would rather err on the side of over-vacuuming,
especially if the system is smart enough to self-correct when that
turns out to be the wrong approach. One of the advantages of running
VACUUM sooner is that it provides us with relatively reliable
information about the needs of the table.

We can also cheat, sort of. If we find another justification for
autovacuuming (e.g., it's a quiet time for the system as a whole), and
it works out to help with this other problem, it may be just as good
for users.

--
Peter Geoghegan



On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> pgstat_report_analyze() will totally override the
> tabentry->dead_tuples information that drives autovacuum.c, based on
> an estimate derived from a random sample -- which seems to me to be an
> approach that just doesn't have any sound theoretical basis.

Yikes. I think we don't have a choice but to have a method to correct
the information somehow, because AFAIK the statistics system is not
crash-safe. But that approach does seem to carry significant risk of
overwriting correct information with wrong information.

> On reflection, maybe you're right here. Maybe it's true that the
> bigger problem is just that the implementation is bad, even on its own
> terms -- since it's pretty bad! Hard to say at this point.
>
> Depends on how you define it, too. Statistically sampling is just not
> fit for purpose here. But is that a problem with
> autovacuum_vacuum_scale_factor? I may have said words that could
> reasonably be interpreted that way, but I'm not prepared to blame it
> on the underlying autovacuum_vacuum_scale_factor model now. It's
> fuzzy.

Yep. I think what we should try to evaluate is which number is
furthest from the truth. My guess is that the threshold is so high
relative to what a reasonable value would be that you can't get any
benefit out of making the dead tuple count more accurate. Like, if the
threshold is 100x too high, or something, then who cares how accurate
the dead tuples number is? It's going to be insufficient to trigger
vacuuming whether it's right or wrong. We should try substituting a
less-bogus threshold calculation and see what happens then. An
alternative theory is that the threshold is fine and we're only
failing to reach it because the dead tuple calculation is so
inaccurate. Maybe that's even true in some scenarios, but I bet that
it's never the issue when people have really big tables. The fact that
I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
1TB of bloat in my 10TB table. Among other problems, I can't even
vacuum away that much bloat in one index pass, because autovacuum
can't use enough work memory for that. Also, the absolute space
wastage matters.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 11:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > pgstat_report_analyze() will totally override the
> > tabentry->dead_tuples information that drives autovacuum.c, based on
> > an estimate derived from a random sample -- which seems to me to be an
> > approach that just doesn't have any sound theoretical basis.
>
> Yikes. I think we don't have a choice but to have a method to correct
> the information somehow, because AFAIK the statistics system is not
> crash-safe. But that approach does seem to carry significant risk of
> overwriting correct information with wrong information.

This situation is really quite awful, so maybe we should do something
about it soon, in the scope of the Postgres 16 work on autovacuum that
is already underway. In fact I think that the problem here is so bad
that even something slightly less naive would be far more effective.

You're right to point out that pgstat_report_analyze needs to update
the stats in case there is a hard crash, of course. But there is
plenty of context with which to make better decisions close at hand.
For example, I bet that pgstat_report_analyze already does a pretty
good job of estimating live_tuples -- my spiel about statistics mostly
doesn't apply to live_tuples. Suppose that we notice that its new
estimate for live_tuples approximately matches what the stats
subsystem already thought about live_tuples, while dead_tuples is far
far lower. We shouldn't be so credulous as to believe the new
dead_tuples estimate at that point.

Perhaps we can change nothing about dead_tuples at all when this
happens. Or perhaps we can set dead_tuples to a value that is scaled
from the old estimate. The new dead_tuples value could be derived by
taking the ratio of the old live_tuples to the old dead_tuples, and
then using that to scale from the new live_tuples. This is just a
first pass, to see what you and others think. Even very simple
heuristics seem like they could make things much better.

Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
acquire_sample_rows() could pay attention to -- especially in larger
tables, where the difference between all pages and just the
all-visible subset of pages is most likely to matter. The more sampled
pages that had PD_ALL_VISIBLE set, the less credible the new
dead_tuples estimate will be (relative to existing information), and
so pgstat_report_analyze() should prefer the new estimate over the old
one in proportion to that.

We probably shouldn't change anything about pgstat_report_vacuum as
part of this effort to make pgstat_report_analyze less terrible in the
near term. It certainly has its problems (what is true for pages that
VACUUM scanned at the end of VACUUM is far from representative for new
pages!), it's probably much less of a contributor to issues like those
that Andres reports seeing.

BTW, one of the nice things about the insert-driven autovacuum stats
is that pgstat_report_analyze doesn't have an opinion about how many
tuples were inserted since the last VACUUM ran. It does have other
problems, but they seem less serious to me.

> Yep. I think what we should try to evaluate is which number is
> furthest from the truth. My guess is that the threshold is so high
> relative to what a reasonable value would be that you can't get any
> benefit out of making the dead tuple count more accurate. Like, if the
> threshold is 100x too high, or something, then who cares how accurate
> the dead tuples number is?

Right. Or if we don't make any reasonable distinction between LP_DEAD
items and dead heap-only tuples, then the total number of both things
together may matter very little. Better to be approximately correct
than exactly wrong. Deliberately introducing a bias to lower the
variance is a perfectly valid approach.

> Maybe that's even true in some scenarios, but I bet that
> it's never the issue when people have really big tables. The fact that
> I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
> 1TB of bloat in my 10TB table. Among other problems, I can't even
> vacuum away that much bloat in one index pass, because autovacuum
> can't use enough work memory for that. Also, the absolute space
> wastage matters.

I certainly agree with all that.

FWIW, part of my mental model with VACUUM is that the rules kind of
change in the case of a big table. We're far more vulnerable to issues
such as (say) waiting for cleanup locks because the overall cadence
used by autovacuum is so infrequently relative to everything else.
There are more opportunities for things to go wrong, worse
consequences when they do go wrong, and greater potential for the
problems to compound.

-- 
Peter Geoghegan



Hi,

On 2023-01-17 12:08:01 -0800, Peter Geoghegan wrote:
> > I think that's not the fault of relfrozenxid as a trigger, but that we simply
> > don't keep enough other stats. We should imo at least keep track of:
> 
> If you assume that there is chronic undercounting of dead tuples
> (which I think is very common), then of course anything that triggers
> vacuuming is going to help with that problem -- it might be totally
> inadequate, but still make the critical difference by not allowing the
> system to become completely destabilized. I absolutely accept that
> users that are relying on that exist, and that those users ought to
> not have things get even worse -- I'm pragmatic. But overall, what we
> should be doing is fixing the real problem, which is that the dead
> tuples accounting is deeply flawed. Actually, it's not just that the
> statistics are flat out wrong; the whole model is flat-out wrong.

I think that depends on what "whole model" encompasses...


> The assumptions that work well for optimizer statistics quite simply
> do not apply here. Random sampling for this is just wrong, because
> we're not dealing with something that follows a distribution that can
> be characterized with a sufficiently large sample. With optimizer
> statistics, the entire contents of the table is itself a sample taken
> from the wider world -- so even very stale statistics can work quite
> well (assuming that the schema is well normalized). Whereas the
> autovacuum dead tuples stuff is characterized by constant change. I
> mean of course it is -- that's the whole point! The central limit
> theorem obviously just doesn't work for something like this  -- we
> cannot generalize from a sample, at all.

If we were to stop dropping stats after crashes, I think we likely could
afford to stop messing with dead tuple stats during analyze. Right now it's
valuable to some degree because it's a way to reaosonably quickly recover from
lost stats.

The main way to collect inserted / dead tuple info for autovacuum's benefit is
via the stats collected when making changes.

We probably ought to simply not update dead tuples after analyze if the stats
entry has information about a prior [auto]vacuum. Or at least split the
fields.


> How many dead heap-only tuples are equivalent to one LP_DEAD item?
> What about page-level concentrations, and the implication for
> line-pointer bloat? I don't have a good answer to any of these
> questions myself. And I have my doubts that there are *any* good
> answers.

Hence my suggestion to track several of these via page level stats. In the big
picture it doesn't really matter that much whether there's 10 or 100 (dead
tuples|items) on a page that needs to be removed. It matters that the page
needs to be processed.


> Even these questions are the wrong questions (they're just less wrong).

I don't agree. Nothing is going to be perfect, but you're not going to be able
to do sensible vacuum scheduling without some stats, and it's fine if those
are an approximation, as long as the approximation makes some sense.


> I'd like to use the visibility map more for stuff here, too. It is
> totally accurate about all-visible/all-frozen pages, so many of my
> complaints about statistics don't really apply. Or need not apply, at
> least. If 95% of a table's pages are all-frozen in the VM, then of
> course it's pretty unlikely to be the right time to VACUUM the table
> if it's to clean up bloat -- this is just about the most reliable
> information we have access to.

I think querying that from stats is too expensive for most things. I suggested
tracking all-frozen in pg_class. Perhaps we should also track when pages are
*removed* from the VM in pgstats, I don't think we do today. That should give
a decent picture?



> > > This sounds like a great argument in favor of suspend-and-resume as a
> > > way of handling autocancellation -- no useful work needs to be thrown
> > > away for AV to yield for a minute or two.
> 
> > Hm, that seems a lot of work. Without having held a lock you don't even know
> > whether your old dead items still apply. Of course it'd improve the situation
> > substantially, if we could get it.
> 
> I don't think it's all that much work, once the visibility map
> snapshot infrastructure is there.
> 
> Why wouldn't your old dead items still apply?

Well, for one the table could have been rewritten. Of course we can add the
code to deal with that, but it is definitely something to be aware of. There
might also be some oddities around indexes getting added / removed.



> > > Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> > > special cases? Maybe they should always be able to auto cancel an
> > > autovacuum?
> >
> > Yea, I think so. It's not obvious how to best pass down that knowledge into
> > ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
> > way would be to change LockAcquireExtended() to get a flags argument instead
> > of reportMemoryError, and then we could add LOCK_ACQUIRE_INTENT_DROP &
> > LOCK_ACQUIRE_INTENT_TRUNCATE or such. Then the same for
> > RangeVarGetRelidExtended(). It already "customizes" how to lock based on RVR*
> > flags.
> 
> It would be tricky, but still relatively straightforward compared to
> other things. It is often a TRUNCATE or a DROP TABLE, and we have
> nothing to lose and everything to gain by changing the rules for
> those.

Probably should also change the rules for VACUUM and VACUUM FULL / CLUSTER, if
we do it. Manual VACUUM will often be faster due to the cost limits, and
VACUUM FULL can be *considerably* faster than VACUUM once you hit bad bloat.


Greetings,

Andres Freund



On Wed, Jan 18, 2023 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Suppose that we notice that its new
> estimate for live_tuples approximately matches what the stats
> subsystem already thought about live_tuples, while dead_tuples is far
> far lower. We shouldn't be so credulous as to believe the new
> dead_tuples estimate at that point.
>
> Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
> acquire_sample_rows() could pay attention to -- especially in larger
> tables, where the difference between all pages and just the
> all-visible subset of pages is most likely to matter. The more sampled
> pages that had PD_ALL_VISIBLE set, the less credible the new
> dead_tuples estimate will be (relative to existing information), and
> so pgstat_report_analyze() should prefer the new estimate over the old
> one in proportion to that.

I don't know enough about the specifics of how this works to have an
intelligent opinion about how likely these particular ideas are to
work out. However, I think it's risky to look at estimates and try to
infer whether they are reliable. It's too easy to be wrong. What we
really want to do is anchor our estimates to some data source that we
know we can trust absolutely. If you trust possibly-bad data less, it
screws up your estimates more slowly, but it still screws them up.

If Andres is correct that what really matter is the number of pages
we're going to have to dirty, we could abandon counting dead tuples
altogether and just count not-all-visible pages in the VM map. That
would be cheap enough to recompute periodically. However, it would
also be a big behavior change from the way we do things now, so I'm
not sure it's a good idea. Still, a quantity that we can be certain
we're measuring accurately is better than one we can't measure
accurately even if it's a somewhat worse proxy for the thing we really
care about. There's a ton of value in not being completely and totally
wrong.

> FWIW, part of my mental model with VACUUM is that the rules kind of
> change in the case of a big table. We're far more vulnerable to issues
> such as (say) waiting for cleanup locks because the overall cadence
> used by autovacuum is so infrequently relative to everything else.
> There are more opportunities for things to go wrong, worse
> consequences when they do go wrong, and greater potential for the
> problems to compound.

Yes. A lot of parts of PostgreSQL, including this one, were developed
a long time ago when PostgreSQL databases were a lot smaller than they
often are today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 12:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know enough about the specifics of how this works to have an
> intelligent opinion about how likely these particular ideas are to
> work out. However, I think it's risky to look at estimates and try to
> infer whether they are reliable. It's too easy to be wrong. What we
> really want to do is anchor our estimates to some data source that we
> know we can trust absolutely. If you trust possibly-bad data less, it
> screws up your estimates more slowly, but it still screws them up.

Some of what I'm proposing arguably amounts to deliberately adding a
bias. But that's not an unreasonable thing in itself. I think of it as
related to the bias-variance tradeoff, which is a concept that comes
up a lot in machine learning and statistical inference.

We can afford to be quite imprecise at times, especially if we choose
a bias that we know has much less potential to do us harm -- some
mistakes hurt much more than others. We cannot afford to ever be
dramatically wrong, though -- especially in the direction of vacuuming
less often.

Besides, there is something that we *can* place a relatively high
degree of trust in that will still be in the loop here: VACUUM itself.
If VACUUM runs then it'll call pgstat_report_vacuum(), which will set
the record straight in the event of over estimating dead tuples. To
some degree the problem of over estimating dead tuples is
self-limiting.

> If Andres is correct that what really matter is the number of pages
> we're going to have to dirty, we could abandon counting dead tuples
> altogether and just count not-all-visible pages in the VM map.

That's what matters most from a cost point of view IMV. So it's a big
part of the overall picture, but not everything. It tells us
relatively little about the benefits, except perhaps when most pages
are all-visible.

-- 
Peter Geoghegan



Hi,

On 2023-01-18 12:15:17 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 11:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > pgstat_report_analyze() will totally override the
> > > tabentry->dead_tuples information that drives autovacuum.c, based on
> > > an estimate derived from a random sample -- which seems to me to be an
> > > approach that just doesn't have any sound theoretical basis.
> >
> > Yikes. I think we don't have a choice but to have a method to correct
> > the information somehow, because AFAIK the statistics system is not
> > crash-safe. But that approach does seem to carry significant risk of
> > overwriting correct information with wrong information.

I suggested nearby to only have ANALYZE dead_tuples it if there's been no
[auto]vacuum since the stats entry was created. That allows recovering from
stats resets, be it via crashes or explicitly. What do you think?

To add insult to injury, we overwrite accurate information gathered by VACUUM
with bad information gathered by ANALYZE if you do VACUUM ANALYZE.



One complicating factor is that VACUUM sometimes computes an incrementally
more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
computes something sane. I unintentionally encountered one when I was trying
something while writing this email, reproducer attached.


VACUUM (DISABLE_PAGE_SKIPPING) foo;
SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 'foo'::regclass;
┌────────────┬────────────┐
│ n_live_tup │ n_dead_tup │
├────────────┼────────────┤
│    9000001 │     500000 │
└────────────┴────────────┘

after one VACUUM:
┌────────────┬────────────┐
│ n_live_tup │ n_dead_tup │
├────────────┼────────────┤
│    8549905 │     500000 │
└────────────┴────────────┘

after 9 more VACUUMs:
┌────────────┬────────────┐
│ n_live_tup │ n_dead_tup │
├────────────┼────────────┤
│    5388421 │     500000 │
└────────────┴────────────┘
(1 row)


I briefly tried it out, and it does *not* reproduce in 11, but does in
12. Haven't dug into what the cause is, but we probably use the wrong
denominator somewhere...

Greetings,

Andres Freund

Attachment

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Some of what I'm proposing arguably amounts to deliberately adding a
> bias. But that's not an unreasonable thing in itself. I think of it as
> related to the bias-variance tradeoff, which is a concept that comes
> up a lot in machine learning and statistical inference.

To be clear, I was thinking of unreservedly trusting what
pgstat_report_analyze() says about dead_tuples in the event of its
estimate increasing our dead_tuples estimate, while being skeptical
(to a varying degree) when it's the other way around.

But now I need to go think about what Andres just brought up...

-- 
Peter Geoghegan



Hi,

On 2023-01-18 13:08:44 -0800, Andres Freund wrote:
> One complicating factor is that VACUUM sometimes computes an incrementally
> more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
> computes something sane. I unintentionally encountered one when I was trying
> something while writing this email, reproducer attached.
> 
> 
> VACUUM (DISABLE_PAGE_SKIPPING) foo;
> SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relid = 'foo'::regclass;
> ┌────────────┬────────────┐
> │ n_live_tup │ n_dead_tup │
> ├────────────┼────────────┤
> │    9000001 │     500000 │
> └────────────┴────────────┘
> 
> after one VACUUM:
> ┌────────────┬────────────┐
> │ n_live_tup │ n_dead_tup │
> ├────────────┼────────────┤
> │    8549905 │     500000 │
> └────────────┴────────────┘
> 
> after 9 more VACUUMs:
> ┌────────────┬────────────┐
> │ n_live_tup │ n_dead_tup │
> ├────────────┼────────────┤
> │    5388421 │     500000 │
> └────────────┴────────────┘
> (1 row)
> 
> 
> I briefly tried it out, and it does *not* reproduce in 11, but does in
> 12. Haven't dug into what the cause is, but we probably use the wrong
> denominator somewhere...

Oh, it does actually reproduce in 11 too - my script just didn't see it
because it was "too fast". For some reason < 12 it takes longer for the new
pgstat snapshot to be available. If I add a few sleeps, it shows in 11.

The real point of change appears to be 10->11.

There's a relevant looking difference in the vac_estimate_reltuples call:
10:
    /* now we can compute the new value for pg_class.reltuples */
    vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
                                                         nblocks,
                                                         vacrelstats->tupcount_pages,
                                                         num_tuples);

11:
    /* now we can compute the new value for pg_class.reltuples */
    vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
                                                          nblocks,
                                                          vacrelstats->tupcount_pages,
                                                          live_tuples);
which points to:

commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2018-03-22 15:47:29 -0400

    Sync up our various ways of estimating pg_class.reltuples.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 1:08 PM Andres Freund <andres@anarazel.de> wrote:
> I suggested nearby to only have ANALYZE dead_tuples it if there's been no
> [auto]vacuum since the stats entry was created. That allows recovering from
> stats resets, be it via crashes or explicitly. What do you think?

I like that idea. It's far simpler than the kind of stuff I was
thinking about, and probably just as effective. Even if it introduces
some unforeseen problem (which seems unlikely), we can still rely on
pgstat_report_vacuum() to set things straight before too long.

Are you planning on writing a patch for this? I'd be very interested
in seeing this through. Could definitely review it.

-- 
Peter Geoghegan



Hi,

On 2023-01-18 13:42:40 -0800, Andres Freund wrote:
> The real point of change appears to be 10->11.
>
> There's a relevant looking difference in the vac_estimate_reltuples call:
> 10:
>     /* now we can compute the new value for pg_class.reltuples */
>     vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
>                                                          nblocks,
>                                                          vacrelstats->tupcount_pages,
>                                                          num_tuples);
>
> 11:
>     /* now we can compute the new value for pg_class.reltuples */
>     vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
>                                                           nblocks,
>                                                           vacrelstats->tupcount_pages,
>                                                           live_tuples);
> which points to:
>
> commit 7c91a0364fcf5d739a09cc87e7adb1d4a33ed112
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   2018-03-22 15:47:29 -0400
>
>     Sync up our various ways of estimating pg_class.reltuples.

The problem with the change is here:

    /*
     * Okay, we've covered the corner cases.  The normal calculation is to
     * convert the old measurement to a density (tuples per page), then
     * estimate the number of tuples in the unscanned pages using that figure,
     * and finally add on the number of tuples in the scanned pages.
     */
    old_density = old_rel_tuples / old_rel_pages;
    unscanned_pages = (double) total_pages - (double) scanned_pages;
    total_tuples = old_density * unscanned_pages + scanned_tuples;
    return floor(total_tuples + 0.5);


Because we'll re-scan the pages for not-yet-removable rows in subsequent
vacuums, the next vacuum will process the same pages again. By using
scanned_tuples = live_tuples, we basically remove not-yet-removable tuples
from reltuples, each time.

The commit *did* try to account for that to some degree:

+    /* also compute total number of surviving heap entries */
+    vacrelstats->new_rel_tuples =
+        vacrelstats->new_live_tuples + vacrelstats->new_dead_tuples;


but new_rel_tuples isn't used for pg_class.reltuples or pgstat.


This is pretty nasty. We use reltuples for a lot of things. And while analyze
might fix it sometimes, that won't reliably be the case, particularly when
there are repeated autovacuums due to a longrunning transaction - there's no
cause for auto-analyze to trigger again soon, while autovacuum will go at it
again and again.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 2:22 PM Andres Freund <andres@anarazel.de> wrote:
> The problem with the change is here:
>
>         /*
>          * Okay, we've covered the corner cases.  The normal calculation is to
>          * convert the old measurement to a density (tuples per page), then
>          * estimate the number of tuples in the unscanned pages using that figure,
>          * and finally add on the number of tuples in the scanned pages.
>          */
>         old_density = old_rel_tuples / old_rel_pages;
>         unscanned_pages = (double) total_pages - (double) scanned_pages;
>         total_tuples = old_density * unscanned_pages + scanned_tuples;
>         return floor(total_tuples + 0.5);

My assumption has always been that vac_estimate_reltuples() is prone
to issues like this because it just doesn't have access to very much
information each time it runs. It can only see the delta between what
VACUUM just saw, and what the last VACUUM (or possibly the last
ANALYZE) saw according to pg_class. You're always going to find
weaknesses in such a model if you go looking for them. You're always
going to find a way to salami slice your way from good information to
total nonsense, if you pick the right/wrong test case, which runs
VACUUM in a way that allows whatever bias there may be to accumulate.
It's sort of like the way floating point values can become very
inaccurate through a process that allows many small inaccuracies to
accumulate over time.

Maybe you're right to be concerned to the degree that you're concerned
-- I'm not sure. I'm just adding what I see as important context.

-- 
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 2:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe you're right to be concerned to the degree that you're concerned
> -- I'm not sure. I'm just adding what I see as important context.

The problems in this area tend to be that vac_estimate_reltuples()
behaves as if it sees a random sample, when in fact it's far from
random -- it's the same scanned_pages as last time, and the ten other
times before that. That's a feature of your test case, and other
similar vac_estimate_reltuples test cases that I came up with in the
past. Both scenarios involved skipping using the visibility map in
multiple successive VACUUM operations.

Perhaps we should make vac_estimate_reltuples focus on the pages that
VACUUM newly set all-visible each time (not including all-visible
pages that got scanned despite being all-visible) -- only that subset
of scanned_pages seems to be truly relevant. That way you wouldn't be
able to do stuff like this. We'd have to start explicitly tracking the
number of pages that were newly set in the VM in vacuumlazy.c to be
able to do that, but that seems like a good idea anyway.

This probably has consequences elsewhere, but maybe that's okay. We
know when the existing pg_class has no information, since that is
explicitly encoded by a reltuples of -1. Obviously we'd need to be
careful about stuff like that. Overall, the danger from assuming that
"unsettled" pages (pages that couldn't be newly set all-visible by
VACUUM) have a generic tuple density seems less than the danger of
assuming that they're representative. We know that we're bound to scan
these same pages in the next VACUUM anyway, so they'll have another
chance to impact our view of the table's tuple density at that time.

--
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The problems in this area tend to be that vac_estimate_reltuples()
> behaves as if it sees a random sample, when in fact it's far from
> random -- it's the same scanned_pages as last time, and the ten other
> times before that. That's a feature of your test case, and other
> similar vac_estimate_reltuples test cases that I came up with in the
> past. Both scenarios involved skipping using the visibility map in
> multiple successive VACUUM operations.

FWIW, the problem in your test case just goes away if you just change this line:

DELETE FROM foo WHERE i < (10000000 * 0.1)

To this:

DELETE FROM foo WHERE i < (10000000 * 0.065)

Which is not a huge difference, overall. This effect is a consequence
of the heuristics I added in commit 74388a1a, so it's only present on
Postgres 15+.

Whether or not this is sufficient protection is of course open to
interpretation. One problem with those heuristics (as far as your test
case is concerned) is that they either work, or they don't work. For
example they're conditioned on "old_rel_pages == total_page".

-- 
Peter Geoghegan



Hi,

On 2023-01-18 14:37:20 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 2:22 PM Andres Freund <andres@anarazel.de> wrote:
> > The problem with the change is here:
> >
> >         /*
> >          * Okay, we've covered the corner cases.  The normal calculation is to
> >          * convert the old measurement to a density (tuples per page), then
> >          * estimate the number of tuples in the unscanned pages using that figure,
> >          * and finally add on the number of tuples in the scanned pages.
> >          */
> >         old_density = old_rel_tuples / old_rel_pages;
> >         unscanned_pages = (double) total_pages - (double) scanned_pages;
> >         total_tuples = old_density * unscanned_pages + scanned_tuples;
> >         return floor(total_tuples + 0.5);
> 
> My assumption has always been that vac_estimate_reltuples() is prone
> to issues like this because it just doesn't have access to very much
> information each time it runs. It can only see the delta between what
> VACUUM just saw, and what the last VACUUM (or possibly the last
> ANALYZE) saw according to pg_class. You're always going to find
> weaknesses in such a model if you go looking for them. You're always
> going to find a way to salami slice your way from good information to
> total nonsense, if you pick the right/wrong test case, which runs
> VACUUM in a way that allows whatever bias there may be to accumulate.
> It's sort of like the way floating point values can become very
> inaccurate through a process that allows many small inaccuracies to
> accumulate over time.

Sure. To start with, there's always going to be some inaccuracies when you
assume an even distribution across a table. But I think this goes beyond
that.

This problem occurs with a completely even distribution, exactly the same
inputs to the estimation function every time.  My example under-sold the
severity, because I had only 5% non-deletable tuples. Here's it with 50%
non-removable tuples (I've seen way worse than 50% in many real-world cases),
and a bunch of complexity removed (attched).

vacuum-no    reltuples/n_live_tup    n_dead_tup
1        4999976            5000000
2        2500077            5000000
3        1250184            5000000
4         625266            5000000
5         312821            5000000
10          10165            5000000

Each vacuum halves reltuples.  That's going to screw badly with all kinds of
things. Planner costs completely out of whack etc.



I wonder if this is part of the reason for the distortion you addressed with
74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
large relation 2% of blocks is a significant number of rows, and simply never
adjusting reltuples seems quite problematic. At the very least we ought to
account for dead tids we removed or such, instead of just freezing reltuples.

Greetings,

Andres Freund

Attachment

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:
> vacuum-no       reltuples/n_live_tup    n_dead_tup
> 1               4999976                 5000000
> 2               2500077                 5000000
> 3               1250184                 5000000
> 4                625266                 5000000
> 5                312821                 5000000
> 10                10165                 5000000
>
> Each vacuum halves reltuples.  That's going to screw badly with all kinds of
> things. Planner costs completely out of whack etc.

I get that that could be a big problem, even relative to the more
immediate problem of VACUUM just spinning like it does in your test
case. What do you think we should do about it? What do you think about
my idea of focussing on the subset of pages newly set all-visible in
the VM?

> I wonder if this is part of the reason for the distortion you addressed with
> 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
> large relation 2% of blocks is a significant number of rows, and simply never
> adjusting reltuples seems quite problematic. At the very least we ought to
> account for dead tids we removed or such, instead of just freezing reltuples.

As I mentioned, it only kicks in when relpages is *precisely* the same
as last time (not one block more or one block less), *and* we only
scanned less than 2% of rel_pages. It's quite possible that that's
insufficient, but I can't imagine it causing any new problems.

I think that we need to be realistic about what's possible while
storing a small, fixed amount of information. There is always going to
be some distortion of this kind. We can do something about the
obviously pathological cases, where errors can grow without bound. But
you have to draw the line somewhere, unless you're willing to replace
the whole approach with something that stores historic metadata.

What kind of tradeoff do you want to make here? I think that you have
to make one.

-- 
Peter Geoghegan



Hi,

On 2023-01-18 13:45:19 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 1:08 PM Andres Freund <andres@anarazel.de> wrote:
> > I suggested nearby to only have ANALYZE dead_tuples it if there's been no
> > [auto]vacuum since the stats entry was created. That allows recovering from
> > stats resets, be it via crashes or explicitly. What do you think?
>
> I like that idea. It's far simpler than the kind of stuff I was
> thinking about, and probably just as effective. Even if it introduces
> some unforeseen problem (which seems unlikely), we can still rely on
> pgstat_report_vacuum() to set things straight before too long.
>
> Are you planning on writing a patch for this? I'd be very interested
> in seeing this through. Could definitely review it.

I can, it should be just about trivial code-wise. A bit queasy about trying to
forsee the potential consequences.


A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
what VACUUM itself observed, ignoring any concurrently reported dead
tuples. As far as I can tell, when vacuum takes a long time, that can lead to
severely under-accounting dead tuples.

We probably loose track of a bit more than 50% of the dead tuples reported
since vacuum started. During the heap scan phase we don't notice all the
tuples reported before the current scan point, and then we don't notice them
at all during the index/heap vacuuming.

The only saving grace is that it'll be corrected at the next VACUUM. But the
next vacuum might very well be delayed noticably due to this.


This is similar to the issue around ins_since_vacuum that Peter pointed out.


I wonder if we ought to remember the dead_tuples value at the start of the
heap scan and use that to adjust the final dead_tuples value. I'd lean towards
over-counting rather than under-counting and thus probably would go for
something like

  tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - deadtuples_at_start);

i.e. assuming we might have missed all concurrently reported dead tuples.




Of course we could instead move to something like ins_since_vacuum and reset
it at the *start* of the vacuum. But that'd make the error case harder,
without giving us more accuracy, I think?


I do think this is an argument for splitting up dead_tuples into separate
"components" that we track differently. I.e. tracking the number of dead
items, not-yet-removable rows, and the number of dead tuples reported from DML
statements via pgstats.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 4:37 PM Andres Freund <andres@anarazel.de> wrote:
> I can, it should be just about trivial code-wise. A bit queasy about trying to
> forsee the potential consequences.

That's always going to be true, though.

> A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
> what VACUUM itself observed, ignoring any concurrently reported dead
> tuples. As far as I can tell, when vacuum takes a long time, that can lead to
> severely under-accounting dead tuples.

Did I not mention that one? There are so many that it can be hard to
keep track! That's why I catalog them.

As you point out, it's the dead tuples equivalent of my
ins_since_vacuum complaint. The problem is exactly analogous to my
recent complaint about insert-driven autovacuums.

> I wonder if we ought to remember the dead_tuples value at the start of the
> heap scan and use that to adjust the final dead_tuples value. I'd lean towards
> over-counting rather than under-counting and thus probably would go for
> something like
>
>   tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - deadtuples_at_start);
>
> i.e. assuming we might have missed all concurrently reported dead tuples.

This is exactly what I was thinking of doing for both issues (the
ins_since_vacuum one and this similar dead tuples one). It's
completely logical.

This creates an awkward but logical question, though: what if
dead_tuples doesn't go down at all? What if VACUUM actually has to
increase it, because VACUUM runs so slowly relative to the workload?
Of course the whole point is to make it more likely that VACUUM will
keep up with the workload. I'm just not quite sure that the
consequences of doing it that way are strictly a good thing. Bearing
in mind that we don't differentiate between recently dead and dead
here.

Fun fact: autovacuum can spin with pgbench because of recently dead
tuples, even absent an old snapshot/long running xact, if you set
things aggressively enough:

https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com

I think that we probably need to do something like always make sure
that dead_items goes down by a small amount at the end of each VACUUM,
even when that's a lie. Maybe we also have a log message about
autovacuum not keeping up, so as to not feel too guilty about it. You
know, to give the user a chance to reconfigure autovacuum so that it
stops happening.

> Of course we could instead move to something like ins_since_vacuum and reset
> it at the *start* of the vacuum. But that'd make the error case harder,
> without giving us more accuracy, I think?

It would. It seems illogical to me.

> I do think this is an argument for splitting up dead_tuples into separate
> "components" that we track differently. I.e. tracking the number of dead
> items, not-yet-removable rows, and the number of dead tuples reported from DML
> statements via pgstats.

Is it? Why?

I'm all in favor of doing that, of course. I just don't particularly
think that it's related to this other problem. One problem is that we
count dead tuples incorrectly because we don't account for the fact
that things change while VACUUM runs. The other problem is that the
thing that is counted isn't broken down into distinct subcategories of
things -- things are bunched together that shouldn't be.

Oh wait, you were thinking of what I said before -- my "awkward but
logical question". Is that it?

-- 
Peter Geoghegan



Hi,

On 2023-01-18 16:19:02 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:
> > vacuum-no       reltuples/n_live_tup    n_dead_tup
> > 1               4999976                 5000000
> > 2               2500077                 5000000
> > 3               1250184                 5000000
> > 4                625266                 5000000
> > 5                312821                 5000000
> > 10                10165                 5000000
> >
> > Each vacuum halves reltuples.  That's going to screw badly with all kinds of
> > things. Planner costs completely out of whack etc.
>
> I get that that could be a big problem, even relative to the more
> immediate problem of VACUUM just spinning like it does in your test
> case. What do you think we should do about it?

The change made in 7c91a0364fc imo isn't right. We need to fix it. I think
it's correct that now pgstat_report_vacuum() doesn't include recently dead
tuples in livetuples - they're still tracked via deadtuples.  But it's wrong
for vacuum's call to vac_update_relstats() to not include recently dead
tuples, at least when we only scanned part of the relation.

I think the right thing would be to not undo the semantic change of
7c91a0364fc as a whole, but instead take recently-dead tuples into account
only in the "Okay, we've covered the corner cases." part, to avoid the
spiraling seen above.

Not super clean, but it seems somewhat fundamental that we'll re-scan pages
full of recently-dead tuples in the near future. If we, in a way, subtract
the recently dead tuples from reltuples in this cycle, we shouldn't do so
again in the next - but not taking recently dead into account, does so.


It's a bit complicated because of the APIs involved. vac_estimate_reltuples()
computes vacrel->new_live_tuples and contains the logic for how to compute the
new reltuples. But we use the ->new_live_tuples both vac_update_relstats(),
where we, imo, should take recently-dead into account for partial scans and
pgstat_report_vacuum where we shouldn't.  I guess we would need to add an
output paramter both for "reltuples" and "new_live_tuples".







> What do you think about my idea of focussing on the subset of pages newly
> set all-visible in the VM?

I don't understand it yet.

On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> Perhaps we should make vac_estimate_reltuples focus on the pages that
> VACUUM newly set all-visible each time (not including all-visible
> pages that got scanned despite being all-visible) -- only that subset
> of scanned_pages seems to be truly relevant. That way you wouldn't be
> able to do stuff like this. We'd have to start explicitly tracking the
> number of pages that were newly set in the VM in vacuumlazy.c to be
> able to do that, but that seems like a good idea anyway.

Can you explain a bit more what you mean with "focus on the pages" means?



> > I wonder if this is part of the reason for the distortion you addressed with
> > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
> > large relation 2% of blocks is a significant number of rows, and simply never
> > adjusting reltuples seems quite problematic. At the very least we ought to
> > account for dead tids we removed or such, instead of just freezing reltuples.
>
> As I mentioned, it only kicks in when relpages is *precisely* the same
> as last time (not one block more or one block less), *and* we only
> scanned less than 2% of rel_pages. It's quite possible that that's
> insufficient, but I can't imagine it causing any new problems.

In OLTP workloads relpages will often not change, even if there's lots of
write activity, because there's plenty free space in the relation, and there's
something not-removable on the last page. relpages also won't change if data
is deleted anywhere but the end.

I don't think it's hard to see this causing problems. Set
autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
frequently vacuum manually. Incrementally delete old data.  Unless analyze
saves you - which might not be run or might have a different scale factor or
not be run manually - reltuples will stay exactly the same, despite data
changing substantially.


Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund <andres@anarazel.de> wrote:
> How about a float autovacuum_no_auto_cancel_age where positive values are
> treated as absolute values, and negative values are a multiple of
> autovacuum_freeze_max_age? And where the "computed" age is capped at
> vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.
>
> And maybe a default setting of -1.8 or so?

Attached is a new revision, v5. I'm not happy with this, but thought
it would be useful to show you where I am with it.

It's a bit awkward that we have a GUC (autovacuum_no_auto_cancel_age)
that can sometimes work as a cutoff that works similarly to both
freeze_max_age and multixact_freeze_max_age, but usually works as a
multiplier. It's both an XID age value, an MXID age value, and a
multiplier on XID/MXID age values.

What if it was just a simple multiplier on
freeze_max_age/multixact_freeze_max_age, without changing any other
detail?

-- 
Peter Geoghegan

Attachment
Hi,

On 2023-01-18 17:00:48 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 4:37 PM Andres Freund <andres@anarazel.de> wrote:
> > I can, it should be just about trivial code-wise. A bit queasy about trying to
> > forsee the potential consequences.
> 
> That's always going to be true, though.
> 
> > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
> > what VACUUM itself observed, ignoring any concurrently reported dead
> > tuples. As far as I can tell, when vacuum takes a long time, that can lead to
> > severely under-accounting dead tuples.
> 
> Did I not mention that one? There are so many that it can be hard to
> keep track! That's why I catalog them.

I don't recall you doing, but there's lot of emails and holes in my head.


> This creates an awkward but logical question, though: what if
> dead_tuples doesn't go down at all? What if VACUUM actually has to
> increase it, because VACUUM runs so slowly relative to the workload?

Sure, that can happen - but it's not made better by having wrong stats :)


> > I do think this is an argument for splitting up dead_tuples into separate
> > "components" that we track differently. I.e. tracking the number of dead
> > items, not-yet-removable rows, and the number of dead tuples reported from DML
> > statements via pgstats.
> 
> Is it? Why?

We have reasonably sophisticated accounting in pgstats what newly live/dead
rows a transaction "creates". So an obvious (and wrong) idea is just decrement
reltuples by the number of tuples removed by autovacuum. But we can't do that,
because inserted/deleted tuples reported by backends can be removed by
on-access pruning and vacuumlazy doesn't know about all changes made by its
call to heap_page_prune().

But I think that if we add a
  pgstat_count_heap_prune(nredirected, ndead, nunused)
around heap_page_prune() and a
  pgstat_count_heap_vacuum(nunused)
in lazy_vacuum_heap_page(), we'd likely end up with a better approximation
than what vac_estimate_reltuples() does, in the "partially scanned" case.



> I'm all in favor of doing that, of course. I just don't particularly
> think that it's related to this other problem. One problem is that we
> count dead tuples incorrectly because we don't account for the fact
> that things change while VACUUM runs. The other problem is that the
> thing that is counted isn't broken down into distinct subcategories of
> things -- things are bunched together that shouldn't be.

If we only adjust the counters incrementally, as we go, we'd not update them
at the end of vacuum. I think it'd be a lot easier to only update the counters
incrementally if we split ->dead_tuples into sub-counters.

So I don't think it's entirely unrelated.

You probably could get close without splitting the counters, by just pushing
down the counting, and only counting redirected and unused during heap
pruning. But I think it's likely to be more accurate with the split counter.



> Oh wait, you were thinking of what I said before -- my "awkward but
> logical question". Is that it?

I'm not quite following? The "awkward but logical" bit is in the email I'm
just replying to, right?

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> > Perhaps we should make vac_estimate_reltuples focus on the pages that
> > VACUUM newly set all-visible each time (not including all-visible
> > pages that got scanned despite being all-visible) -- only that subset
> > of scanned_pages seems to be truly relevant. That way you wouldn't be
> > able to do stuff like this. We'd have to start explicitly tracking the
> > number of pages that were newly set in the VM in vacuumlazy.c to be
> > able to do that, but that seems like a good idea anyway.
>
> Can you explain a bit more what you mean with "focus on the pages" means?

We don't say anything about pages we didn't scan right now -- only
scanned_pages have new information, so we just extrapolate. Why not go
even further than that, by only saying something about the pages that
were both scanned and newly set all-visible?

Under this scheme, the pages that were scanned but couldn't be newly
set all-visible are treated just like the pages that weren't scanned
at all -- they get a generic estimate from the existing reltuples.

> I don't think it's hard to see this causing problems. Set
> autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
> frequently vacuum manually. Incrementally delete old data.  Unless analyze
> saves you - which might not be run or might have a different scale factor or
> not be run manually - reltuples will stay exactly the same, despite data
> changing substantially.

You seem to be saying that it's a problem if we don't update reltuples
-- an estimate -- when less than 2% of the table is scanned by VACUUM.
But why? Why can't we just do nothing sometimes? I mean in general,
leaving aside the heuristics I came up with for a moment?

It will cause problems if we remove the heuristics. Much less
theoretical problems. What about those?

How often does VACUUM scan so few pages, anyway? We've been talking
about how ineffective autovacuum_vacuum_scale_factor is, at great
length, but now you're saying that it *can* meaningfully trigger not
just one VACUUM, but many VACUUMs, where no more than 2% of  rel_pages
are not all-visible (pages, not tuples)? Not just once, mind you, but
many times? And in the presence of some kind of highly variable tuple
size, where it actually could matter to the planner at some point?

I would be willing to just avoid even these theoretical problems if
there was some way to do so, that didn't also create new problems. I
have my doubts that that is possible, within the constraints of
updating pg_class. Or the present constraints, at least. I am not a
miracle worker -- I can only do so much with the information that's
available to vac_update_relstats (and/or the information that can
easily be made available).

-- 
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 6:10 PM Andres Freund <andres@anarazel.de> wrote:
> > This creates an awkward but logical question, though: what if
> > dead_tuples doesn't go down at all? What if VACUUM actually has to
> > increase it, because VACUUM runs so slowly relative to the workload?
>
> Sure, that can happen - but it's not made better by having wrong stats :)

Maybe it's that simple. It's reasonable to wonder how far we want to
go with letting dead tuples grow and grow, even when VACUUM is running
constantly. It's not a question that has an immediate and obvious
answer IMV.

Maybe the real question is: is this an opportunity to signal to the
user (say via a LOG message) that VACUUM cannot keep up? That might be
very useful, in a general sort of way (not just to avoid new
problems).

> We have reasonably sophisticated accounting in pgstats what newly live/dead
> rows a transaction "creates". So an obvious (and wrong) idea is just decrement
> reltuples by the number of tuples removed by autovacuum.

Did you mean dead_tuples?

> But we can't do that,
> because inserted/deleted tuples reported by backends can be removed by
> on-access pruning and vacuumlazy doesn't know about all changes made by its
> call to heap_page_prune().

I'm not following here. Perhaps this is a good sign that I should stop
working for the day.  :-)

> But I think that if we add a
>   pgstat_count_heap_prune(nredirected, ndead, nunused)
> around heap_page_prune() and a
>   pgstat_count_heap_vacuum(nunused)
> in lazy_vacuum_heap_page(), we'd likely end up with a better approximation
> than what vac_estimate_reltuples() does, in the "partially scanned" case.

What does vac_estimate_reltuples() have to do with dead tuples?

--
Peter Geoghegan



Hi,

On 2023-01-18 18:21:33 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> > > Perhaps we should make vac_estimate_reltuples focus on the pages that
> > > VACUUM newly set all-visible each time (not including all-visible
> > > pages that got scanned despite being all-visible) -- only that subset
> > > of scanned_pages seems to be truly relevant. That way you wouldn't be
> > > able to do stuff like this. We'd have to start explicitly tracking the
> > > number of pages that were newly set in the VM in vacuumlazy.c to be
> > > able to do that, but that seems like a good idea anyway.
> >
> > Can you explain a bit more what you mean with "focus on the pages" means?
>
> We don't say anything about pages we didn't scan right now -- only
> scanned_pages have new information, so we just extrapolate. Why not go
> even further than that, by only saying something about the pages that
> were both scanned and newly set all-visible?
>
> Under this scheme, the pages that were scanned but couldn't be newly
> set all-visible are treated just like the pages that weren't scanned
> at all -- they get a generic estimate from the existing reltuples.

I don't think that'd work well either. If we actually removed a large chunk of
the tuples in the table it should be reflected in reltuples, otherwise costing
and autovac scheduling suffers. And you might not be able to set all those
page to all-visible, because of more recent rows.


> > I don't think it's hard to see this causing problems. Set
> > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
> > frequently vacuum manually. Incrementally delete old data.  Unless analyze
> > saves you - which might not be run or might have a different scale factor or
> > not be run manually - reltuples will stay exactly the same, despite data
> > changing substantially.
>
> You seem to be saying that it's a problem if we don't update reltuples
> -- an estimate -- when less than 2% of the table is scanned by VACUUM.
> But why? Why can't we just do nothing sometimes? I mean in general,
> leaving aside the heuristics I came up with for a moment?

The problem isn't that we might apply the heuristic once, that'd be fine. But
that there's nothing preventing it from applying until there basically are no
tuples left, as long as the vacuum is frequent enough.

As a demo: The attached sql script ends up with a table containing 10k rows,
but relpages being set 1 million.

VACUUM foo;
EXPLAIN (ANALYZE) SELECT * FROM foo;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                QUERY PLAN                                                 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Seq Scan on foo  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=3.251..4.693 rows=10000 loops=1) │
│ Planning Time: 0.056 ms                                                                                   │
│ Execution Time: 5.491 ms                                                                                  │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)


> It will cause problems if we remove the heuristics. Much less
> theoretical problems. What about those?

I don't think what I describe is a theoretical problem.


> How often does VACUUM scan so few pages, anyway? We've been talking
> about how ineffective autovacuum_vacuum_scale_factor is, at great
> length, but now you're saying that it *can* meaningfully trigger not
> just one VACUUM, but many VACUUMs, where no more than 2% of  rel_pages
> are not all-visible (pages, not tuples)? Not just once, mind you, but
> many times?

I've seen autovacuum_vacuum_scale_factor set to 0.01 repeatedly. But that's
not even needed - you just need a longrunning transaction preventing at least
one dead row from getting removed and hit autovacuum_freeze_max_age. There'll
be continuous VACUUMs of the table, all only processing a small fraction of
the table.

And I have many times seen bulk loading / deletion scripts that do VACUUM on a
regular schedule, which also could easily trigger this.


> And in the presence of some kind of highly variable tuple
> size, where it actually could matter to the planner at some point?

I don't see how a variable tuple size needs to be involved? As the EXPLAIN
ANALYZE above shows, we'll end up with wrong row count estimates etc.


> I would be willing to just avoid even these theoretical problems if
> there was some way to do so, that didn't also create new problems. I
> have my doubts that that is possible, within the constraints of
> updating pg_class. Or the present constraints, at least. I am not a
> miracle worker -- I can only do so much with the information that's
> available to vac_update_relstats (and/or the information that can
> easily be made available).

I'm worried they might cause new problems.

Greetings,

Andres Freund

Attachment

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 18, 2023 at 7:04 PM Andres Freund <andres@anarazel.de> wrote:
> > You seem to be saying that it's a problem if we don't update reltuples
> > -- an estimate -- when less than 2% of the table is scanned by VACUUM.
> > But why? Why can't we just do nothing sometimes? I mean in general,
> > leaving aside the heuristics I came up with for a moment?
>
> The problem isn't that we might apply the heuristic once, that'd be fine. But
> that there's nothing preventing it from applying until there basically are no
> tuples left, as long as the vacuum is frequent enough.
>
> As a demo: The attached sql script ends up with a table containing 10k rows,
> but relpages being set 1 million.

I saw that too. But then I checked again a few seconds later, and
autoanalyze had run, so reltuples was 10k. Just like it would have if
there was no VACUUM statements in your script.

-- 
Peter Geoghegan



On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> pgstat_report_analyze() will totally override the
> tabentry->dead_tuples information that drives autovacuum.c, based on
> an estimate derived from a random sample -- which seems to me to be an
> approach that just doesn't have any sound theoretical basis.

In other words, ANALYZE sometimes (but not always) produces wrong answers.

On Wed, Jan 18, 2023 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:
> One complicating factor is that VACUUM sometimes computes an incrementally
> more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
> computes something sane. I unintentionally encountered one when I was trying
> something while writing this email, reproducer attached.

In other words, VACUUM sometimes (but not always) produces wrong answers.

TL;DR: We're screwed.

I refuse to believe that any amount of math you can do on numbers that
can be arbitrarily inaccurate will result in an accurate answer
popping out the other end. Trying to update the reltuples estimate
incrementally based on an estimate derived from a non-random,
likely-to-be-skewed subset of the table is always going to produce
distortion that gets worse and worse the more times you do it. If
could say, well, the existing estimate of let's say 100 tuples per
page is based on the density being 200 tuples per page in the pages I
just scanned and 50 tuples per page in the rest of the table, then you
could calculate a new estimate that keeps the value of 50 tuples per
page for the remainder of the table intact and just replaces the
estimate for the part you just scanned. But we have no way of doing
that, so we just make some linear combination of the old estimate with
the new one. That overweights the repeatedly-sampled portions of the
table more and more, making the estimate wronger and wronger.

Now, that is already quite bad. But if we accept the premise that
neither VACUUM nor ANALYZE is guaranteed to ever produce a new
actually-reliable estimate, then not only will we go progressively
more wrong as time goes by, but we have no way of ever fixing
anything. If you get a series of unreliable data points followed by a
reliable data point, you can at least get back on track when the
reliable data shows up. But it sounds like you guys are saying that
there's no guarantee that will ever happen, which is a bit like
discovering that not only do you have a hole in your gas tank but
there is no guarantee that you will arrive at a gas station ever again
regardless of distance travelled.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2023-01-19 15:12:12 -0500, Robert Haas wrote:
> On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > pgstat_report_analyze() will totally override the
> > tabentry->dead_tuples information that drives autovacuum.c, based on
> > an estimate derived from a random sample -- which seems to me to be an
> > approach that just doesn't have any sound theoretical basis.
> 
> In other words, ANALYZE sometimes (but not always) produces wrong answers.

For dead tuples, but not live tuples.


> On Wed, Jan 18, 2023 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:
> > One complicating factor is that VACUUM sometimes computes an incrementally
> > more bogus n_live_tup when it skips pages due to the VM, whereas ANALYZE
> > computes something sane. I unintentionally encountered one when I was trying
> > something while writing this email, reproducer attached.
> 
> In other words, VACUUM sometimes (but not always) produces wrong answers.

For live tuples, but not badly so for dead tuples.


> TL;DR: We're screwed.

We are, but perhaps not too badly so, because we can choose to believe analyze
more for live tuples, and vacuum for dead tuples. Analyze doesn't compute
reltuples incrementally and vacuum doesn't compute deadtuples incrementally.



> I refuse to believe that any amount of math you can do on numbers that
> can be arbitrarily inaccurate will result in an accurate answer
> popping out the other end. Trying to update the reltuples estimate
> incrementally based on an estimate derived from a non-random,
> likely-to-be-skewed subset of the table is always going to produce
> distortion that gets worse and worse the more times you do it. If
> could say, well, the existing estimate of let's say 100 tuples per
> page is based on the density being 200 tuples per page in the pages I
> just scanned and 50 tuples per page in the rest of the table, then you
> could calculate a new estimate that keeps the value of 50 tuples per
> page for the remainder of the table intact and just replaces the
> estimate for the part you just scanned. But we have no way of doing
> that, so we just make some linear combination of the old estimate with
> the new one. That overweights the repeatedly-sampled portions of the
> table more and more, making the estimate wronger and wronger.

Perhaps we should, at least occasionally, make vacuum do a cheaper version of
analyze's sampling to compute an updated reltuples. This could even happen
during the heap scan phase.

I don't like relying on analyze to fix vacuum's bogus reltuples, because
there's nothing forcing an analyze run soon after vacuum [incrementally]
screwed it up. Vacuum can be forced to run a lot of times due to xid horizons
preventing cleanup, after which there isn't anything forcing analyze to run
again.

But in contrast to dead_tuples, where I think we can just stop analyze from
updating it unless we crashed recently, I do think we need to update reltuples
in vacuum. So computing an accurate value seems like the least unreasonable
thing I can see.

Greetings,

Andres Freund



Hi

On 2023-01-18 19:26:22 -0800, Peter Geoghegan wrote:
> On Wed, Jan 18, 2023 at 7:04 PM Andres Freund <andres@anarazel.de> wrote:
> > > You seem to be saying that it's a problem if we don't update reltuples
> > > -- an estimate -- when less than 2% of the table is scanned by VACUUM.
> > > But why? Why can't we just do nothing sometimes? I mean in general,
> > > leaving aside the heuristics I came up with for a moment?
> >
> > The problem isn't that we might apply the heuristic once, that'd be fine. But
> > that there's nothing preventing it from applying until there basically are no
> > tuples left, as long as the vacuum is frequent enough.
> >
> > As a demo: The attached sql script ends up with a table containing 10k rows,
> > but relpages being set 1 million.
> 
> I saw that too. But then I checked again a few seconds later, and
> autoanalyze had run, so reltuples was 10k. Just like it would have if
> there was no VACUUM statements in your script.

There's absolutely no guarantee that autoanalyze is triggered
there. Particularly with repeated vacuums triggered due to an relfrozenxid age
that can't be advanced that very well might not happen within days on a large
relation.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 12:56 PM Andres Freund <andres@anarazel.de> wrote:
> > In other words, ANALYZE sometimes (but not always) produces wrong answers.
>
> For dead tuples, but not live tuples.

> > In other words, VACUUM sometimes (but not always) produces wrong answers.
>
> For live tuples, but not badly so for dead tuples.

Agreed. More generally, there is a need to think about the whole table
in some cases (like for regular optimizer statistics including
reltuples/live tuples), and the subset of pages that will be scanned
by VACUUM in other cases (for dead tuples accounting). Random sampling
at the table level is appropriate and works well enough for the
former, though not for the latter.

> We are, but perhaps not too badly so, because we can choose to believe analyze
> more for live tuples, and vacuum for dead tuples. Analyze doesn't compute
> reltuples incrementally and vacuum doesn't compute deadtuples incrementally.

Good points.

> But in contrast to dead_tuples, where I think we can just stop analyze from
> updating it unless we crashed recently, I do think we need to update reltuples
> in vacuum. So computing an accurate value seems like the least unreasonable
> thing I can see.

I agree, but there is no reasonable basis for treating scanned_pages
as a random sample, especially if it's only a small fraction of all of
rel_pages -- treating it as a random sample is completely
unjustifiable. And so it seems to me that the only thing that can be
done is to either make VACUUM behave somewhat like ANALYZE in at least
some cases, or to have it invoke ANALYZE directly (or indirectly) in
those same cases.

-- 
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 12:58 PM Andres Freund <andres@anarazel.de> wrote:
> There's absolutely no guarantee that autoanalyze is triggered
> there. Particularly with repeated vacuums triggered due to an relfrozenxid age
> that can't be advanced that very well might not happen within days on a large
> relation.

Arguments like that work far better as arguments in favor of the
vac_estimate_reltuples heuristics.

That doesn't mean that the heuristics are good in any absolute sense,
of course. They were just a band aid intended to ameliorate some of
the negative impact that came from treating scanned_pages as a random
sample. I think that we both agree that the real problem is that
scanned_pages just isn't a random sample, at least not as far as
reltuples/live tuples is concerned (for dead tuples it kinda isn't a
sample, but is rather something close to an exact count).

I now understand that you're in favor of addressing the root problem
directly. I am also in favor of that approach. I'd be more than happy
to get rid of the band aid as part of that whole effort.

-- 
Peter Geoghegan



Hi,

On 2023-01-19 13:36:41 -0800, Peter Geoghegan wrote:
> On Thu, Jan 19, 2023 at 12:58 PM Andres Freund <andres@anarazel.de> wrote:
> > There's absolutely no guarantee that autoanalyze is triggered
> > there. Particularly with repeated vacuums triggered due to an relfrozenxid age
> > that can't be advanced that very well might not happen within days on a large
> > relation.
> 
> Arguments like that work far better as arguments in favor of the
> vac_estimate_reltuples heuristics.

I don't agree. But mainly my issue is that the devil you know (how this has
worked for a while) is preferrable to introducing an unknown quantity (your
patch that hasn't yet seen real world exposure).

Greetings,

Andres Freund



Hi,

On 2023-01-19 13:22:28 -0800, Peter Geoghegan wrote:
> On Thu, Jan 19, 2023 at 12:56 PM Andres Freund <andres@anarazel.de> wrote:
> > But in contrast to dead_tuples, where I think we can just stop analyze from
> > updating it unless we crashed recently, I do think we need to update reltuples
> > in vacuum. So computing an accurate value seems like the least unreasonable
> > thing I can see.
> 
> I agree, but there is no reasonable basis for treating scanned_pages
> as a random sample, especially if it's only a small fraction of all of
> rel_pages -- treating it as a random sample is completely
> unjustifiable.

Agreed.


> And so it seems to me that the only thing that can be done is to either make
> VACUUM behave somewhat like ANALYZE in at least some cases, or to have it
> invoke ANALYZE directly (or indirectly) in those same cases.

Yea. Hence my musing about potentially addressing this by choosing to visit
additional blocks during the heap scan using vacuum's block sampling logic.

IME most of the time in analyze isn't spent doing IO for the sample blocks
themselves, but CPU time and IO for toasted columns. A trimmed down version
that just computes relallvisible should be a good bit faster.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 2:54 PM Andres Freund <andres@anarazel.de> wrote:
> Yea. Hence my musing about potentially addressing this by choosing to visit
> additional blocks during the heap scan using vacuum's block sampling logic.

I'd rather just invent a way for vacuumlazy.c to tell the top-level
vacuum.c caller "I didn't update reltuples, but you ought to go
ANALYZE the table now that I'm done, even if you weren't already
planning to do so". This wouldn't have to happen every time, but it
would happen fairly often.

> IME most of the time in analyze isn't spent doing IO for the sample blocks
> themselves, but CPU time and IO for toasted columns. A trimmed down version
> that just computes relallvisible should be a good bit faster.

I worry about that from a code maintainability point of view. I'm
concerned that it won't be very cut down at all, in the end.

Presumably you'll want to add the same I/O prefetching logic to this
cut-down version, just for example. Since without that there will be
no competition between it and ANALYZE proper. Besides which, isn't it
kinda wasteful to not just do a full ANALYZE? Sure, you can avoid
detoasting overhead that way. But even still.

-- 
Peter Geoghegan



Hi,

On 2023-01-19 15:10:38 -0800, Peter Geoghegan wrote:
> On Thu, Jan 19, 2023 at 2:54 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea. Hence my musing about potentially addressing this by choosing to visit
> > additional blocks during the heap scan using vacuum's block sampling logic.
> 
> I'd rather just invent a way for vacuumlazy.c to tell the top-level
> vacuum.c caller "I didn't update reltuples, but you ought to go
> ANALYZE the table now that I'm done, even if you weren't already
> planning to do so".

I'm worried about increasing the number of analyzes that much - on a subset of
workloads it's really quite slow.


Another version of this could be to integrate analyze.c's scan more closely
with vacuum all the time. It's a bit bonkers that we often sequentially read
blocks, evict them from shared buffers if we read them, just to then
afterwards do random IO for blocks we've already read. That's imo what we
eventually should do, but clearly it's not a small project.


> This wouldn't have to happen every time, but it would happen fairly often.

Do you have a mechanism for that in mind? Just something vacuum_count % 10 ==
0 like? Or remember scanned_pages in pgstats and re-computing


> > IME most of the time in analyze isn't spent doing IO for the sample blocks
> > themselves, but CPU time and IO for toasted columns. A trimmed down version
> > that just computes relallvisible should be a good bit faster.
> 
> I worry about that from a code maintainability point of view. I'm
> concerned that it won't be very cut down at all, in the end.

I think it'd be fine to just use analyze.c and pass in an option to not
compute column and inheritance stats.


> Presumably you'll want to add the same I/O prefetching logic to this
> cut-down version, just for example. Since without that there will be
> no competition between it and ANALYZE proper. Besides which, isn't it
> kinda wasteful to not just do a full ANALYZE? Sure, you can avoid
> detoasting overhead that way. But even still.

It's not just that analyze is expensive, I think it'll also be confusing if
the column stats change after a manual VACUUM without ANALYZE.

It shouldn't be too hard to figure out whether we're going to do an analyze
anyway and not do the rowcount-estimate version when doing VACUUM ANALYZE or
if autovacuum scheduled an analyze as well.

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Thu, Jan 19, 2023 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> Another version of this could be to integrate analyze.c's scan more closely
> with vacuum all the time. It's a bit bonkers that we often sequentially read
> blocks, evict them from shared buffers if we read them, just to then
> afterwards do random IO for blocks we've already read. That's imo what we
> eventually should do, but clearly it's not a small project.

Very often, what we're really missing in VACUUM is high level context.
That's true of what you say here, about analyze.c, as well as
complaints like your vac_estimate_reltuples complaint. The problem
scenarios involving vac_estimate_reltuples all involve repeating the
same silly action again and again, never realizing that that's what's
going on. I've found it very useful to think of one VACUUM as picking
up where the last one left off for my work on freezing.

This seems related to pre-autovacuum historical details. VACUUM
shouldn't really be a command in the same way that CREATE INDEX is a
command. I do think that we need to retain a VACUUM command in some
form, but it should be something pretty close to a command that just
enqueues off-schedule autovacuums. That can do things like coalesce
duplicate requests into one.

Anyway, I am generally in favor of a design that makes VACUUM and
ANALYZE things that are more or less owned by autovacuum. It should be
less and less of a problem to blur the distinction between VACUUM and
ANALYZE under this model, in each successive release. These
distinctions are quite unhelpful, overall, because they make it hard
for autovacuum scheduling to work at the whole-system level.

> > This wouldn't have to happen every time, but it would happen fairly often.
>
> Do you have a mechanism for that in mind? Just something vacuum_count % 10 ==
> 0 like? Or remember scanned_pages in pgstats and re-computing

I was thinking of something very simple like that, yes.

> I think it'd be fine to just use analyze.c and pass in an option to not
> compute column and inheritance stats.

That could be fine. Just as long as it's not duplicative in an obvious way.

> > Presumably you'll want to add the same I/O prefetching logic to this
> > cut-down version, just for example. Since without that there will be
> > no competition between it and ANALYZE proper. Besides which, isn't it
> > kinda wasteful to not just do a full ANALYZE? Sure, you can avoid
> > detoasting overhead that way. But even still.
>
> It's not just that analyze is expensive, I think it'll also be confusing if
> the column stats change after a manual VACUUM without ANALYZE.

Possibly, but it doesn't have to happen there. It's not like the rules
aren't a bit different compared to autovacuum already. For example,
the way TOAST tables are handled by the VACUUM command versus
autovacuum.

Even if it's valuable to maintain this kind of VACUUM/autovacuum
parity (which I tend to doubt), doesn't the same argument work almost
as well with whatever stripped down version you come up with? It's
also confusing that a manual VACUUM command will be doing an
ANALYZE-like thing. Especially in cases where it's really expensive
relative to the work of VACUUM, because VACUUM scanned so few pages.
You just have to make some kind of trade-off.

-- 
Peter Geoghegan



On Thu, Jan 19, 2023 at 5:51 PM Andres Freund <andres@anarazel.de> wrote:
> I don't agree. But mainly my issue is that the devil you know (how this has
> worked for a while) is preferrable to introducing an unknown quantity (your
> patch that hasn't yet seen real world exposure).

Yeah, this is a major reason why I'm very leery about changes in this
area. A lot of autovacuum behavior is emergent, in the sense that it
wasn't directly intended by whoever wrote the code. It's just a
consequence of other decisions that probably seemed very reasonable at
the time they were made but turned out to have surprising and
unpleasant consequences.

In this particular case, I think that there is a large risk that
postponing auto-cancellation will make things significantly worse,
possibly drastically worse, for a certain class of users -
specifically, those whose vacuums often get auto-cancelled. I think
that it's actually pretty common for people to have workloads where
something pretty close to all of the autovacuums get auto-cancelled on
certain tables, and those people are always hard up against
autovacuum_freeze_max_age because they *have* to hit that in order to
get any vacuuming done on the affected tables. If the default
threshold for auto-cancellation goes up, those people will be
vacuuming even less often than they are now.

That's why I really liked your idea of decoupling auto-cancellation
from XID age. Such an approach can still avoid disabling
auto-cancellation just because autovacuum_freeze_max_age has been hit,
but it can also disable it much earlier when it detects that doing so
is necessary to make progress.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Jan 20, 2023 at 5:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, this is a major reason why I'm very leery about changes in this
> area. A lot of autovacuum behavior is emergent, in the sense that it
> wasn't directly intended by whoever wrote the code. It's just a
> consequence of other decisions that probably seemed very reasonable at
> the time they were made but turned out to have surprising and
> unpleasant consequences.

I certainly agree with your general description of the ways things
are. To a large degree we're living in a world where DBAs have already
compensated for some of the autovacuum shortcomings discussed on this
thread. For example, by setting autovacuum_vacuum_scale_factor (and
even autovacuum_vacuum_insert_scale_factor) to very low values, to
compensate for the issues with random sampling of dead tuples by
analyze, and to compensate for the way that VACUUM doesn't reason
correctly about how the number of dead tuples changes as VACUUM runs.
They might not have thought of it that way -- it could have happened
as a byproduct of tuning a production system through trial and error
-- but it still counts as compensating for a defect in autovacuum
scheduling IMV.

It's actually quite likely that even a strict improvement to (say)
autovacuum scheduling will cause some number of regressions, since now
what were effectively mitigations become unnecessary. This is somewhat
similar to the dynamic with optimizer improvements, where (say) a
selectivity estimate function that's better by every available metric
can still easily cause regressions that really cannot be blamed on the
improvement itself. I personally believe that it's a price worth
paying when it comes to the issues with autovacuum statistics,
particularly the dead tuple count issues. Since much of the behavior
that we sometimes see is just absurdly bad. We have both water tight
theoretical arguments and practical experiences pointing in that
direction.

> In this particular case, I think that there is a large risk that
> postponing auto-cancellation will make things significantly worse,
> possibly drastically worse, for a certain class of users -
> specifically, those whose vacuums often get auto-cancelled.

I agree that that's a real concern for the autocancellation side of
things. That seems quite different to the dead tuples accounting
issues, in that nobody would claim that the existing behavior is
flagrantly wrong (just that it sometimes causes problems instead of
preventing them).

> That's why I really liked your idea of decoupling auto-cancellation
> from XID age. Such an approach can still avoid disabling
> auto-cancellation just because autovacuum_freeze_max_age has been hit,
> but it can also disable it much earlier when it detects that doing so
> is necessary to make progress.

To be clear, I didn't think that that's what Andres was proposing, and
my recent v5 doesn't actually do that. Even in v5, it's still
fundamentally impossible for autovacuums that are triggered by the
tuples inserted or dead tuples thresholds to not be autocancellable.

ISTM that it doesn't make sense to condition the autocancellation
behavior on table XID age in the context of dead tuple VACUUMs. It
could either be way too early or way too late at that point. I was
rather hoping to not have to build the infrastructure required for
fully decoupling the autocancellation behavior from the triggering
condition (dead tuples vs table XID age) in the scope of this
thread/patch, though I can see the appeal of that.

The only reason why I'm using table age at all is because that's how
it works already, rightly or wrongly. If nothing else, t's pretty
clear that there is no theoretical or practical reason why it has to
be exactly the same table age as the one for launching autovacuums to
advance relfrozenxid/relminmxid. In v5 of the patch, the default is to
use 1.8x of the threshold that initially makes autovacuum.c want to
launch autovacuums to deal with table age. That's based on a
suggestion from Andres, but I'd be almost as happy with a default as
low as 1.1x or even 1.05x. That's going to make very little difference
to those users that really rely on the no-auto-cancellation behavior,
while at the same time making things a lot safer for scenarios like
the Joyent/Manta "DROP TRIGGER" outage (not perfectly safe, by any
means, but meaningfully safer).

-- 
Peter Geoghegan



On Fri, Jan 20, 2023 at 3:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The only reason why I'm using table age at all is because that's how
> it works already, rightly or wrongly. If nothing else, t's pretty
> clear that there is no theoretical or practical reason why it has to
> be exactly the same table age as the one for launching autovacuums to
> advance relfrozenxid/relminmxid. In v5 of the patch, the default is to
> use 1.8x of the threshold that initially makes autovacuum.c want to
> launch autovacuums to deal with table age. That's based on a
> suggestion from Andres, but I'd be almost as happy with a default as
> low as 1.1x or even 1.05x. That's going to make very little difference
> to those users that really rely on the no-auto-cancellation behavior,
> while at the same time making things a lot safer for scenarios like
> the Joyent/Manta "DROP TRIGGER" outage (not perfectly safe, by any
> means, but meaningfully safer).

It doesn't seem that way to me. What am I missing? In that case, the
problem was a DROP TRIGGER command waiting behind autovacuum's lock
and thus causing all new table locks to wait behind DROP TRIGGER's
lock request. But it does not sound like that was a one-off event. It
sounds like they used DROP TRIGGER pretty regularly. So I think this
sounds like exactly the kind of case I was talking about, where
autovacuums keep getting cancelled until we decide to stop cancelling
them.

If so, then they were going to have a problem whenever that happened.
Delaying the point at which we stop cancelling them would not help at
all, as your patch would do. What about stopping cancelling them
sooner, as with the proposal to switch to that behavior after a
certain number of auto-cancels? That doesn't prevent the problem
either. If it's aggressive enough, it has some chance of making the
problem visible in a well-run test environment, which could
conceivably prevent you from hitting it in production, but certainly
there are no guarantees.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Fri, Jan 20, 2023 at 12:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> It doesn't seem that way to me. What am I missing? In that case, the
> problem was a DROP TRIGGER command waiting behind autovacuum's lock
> and thus causing all new table locks to wait behind DROP TRIGGER's
> lock request. But it does not sound like that was a one-off event.

It's true that I cannot categorically state that it would have made
the crucial difference in this particular case. It comes down to two
factors:

1. How many attempts would any given amount of additional XID space
head room have bought them in practice? We can be all but certain that
the smallest possible number is 1, which is something.

2. Would that have been enough for relfrozenxid to be advanced in practice?

I think that it's likely that the answer to 2 is yes, since there was
no mention of bloat as a relevant factor at any point in the
postmortem. It was all about locking characteristics of antiwraparound
autovacuuming in particular, and its interaction with their
application. I think that they were perfectly okay with the autovacuum
cancellation behavior most of the time. In fact, I don't think that
there was any bloat in the table at all -- it was a really huge table
(likely an events table), and those tend to be append-only.

Even if I'm wrong about this specific case (we'll never know for
sure), the patch as written would be virtually guaranteed to make the
crucial differences in cases that I have seen up close. For example, a
case with TRUNCATE.

> It sounds like they used DROP TRIGGER pretty regularly. So I think this
> sounds like exactly the kind of case I was talking about, where
> autovacuums keep getting cancelled until we decide to stop cancelling
> them.

I don't know how you can reach that conclusion. The chances of a
non-aggressive VACUUM advancing relfrozenxid right now are virtually
zero, at least for a big table like this one. It seems quite likely
that plenty of non-aggressive autovacuums completed, or would have had
the insert-driven autovacuum feature been available.

The whole article was about how this DROP TRIGGER pattern worked just
fine most of the time, because most of the time autovacuum was just
autocancelled. They say this at one point:

"The normal autovacuum mechanism is skipped when locks are held in
order to minimize service disruption. However, because transaction
wraparound is such a severe problem, if the system gets too close to
wraparound, an autovacuum is launched that does not back off under
lock contention."

At another point:

"When the outage was resolved, we still had a number of questions: is
a wraparound autovacuum always so disruptive? Given that it was
blocking all table operations, why does it throttle itself?"

ISTM that it was a combination of aggressive vacuuming taking far
longer than usual (especially likely because this was pre freeze map),
and the no-auto-cancel behavior. Aggressive/antiwraparound VACUUMs are
naturally much more likely to coincide with periodic DDL, just because
they take so much longer. That is a dangerous combination.

-- 
Peter Geoghegan



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> I think with a bit of polish "Add autovacuum trigger instrumentation." ought
> to be quickly mergeable.

Any thoughts on v5-0001-*?

It would be nice to get the uncontentious part of all this (which is
the instrumentation patch) out of the way soon.

-- 
Peter Geoghegan



Hi,

On 2023-01-21 18:37:59 -0800, Peter Geoghegan wrote:
> On Tue, Jan 17, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> > I think with a bit of polish "Add autovacuum trigger instrumentation." ought
> > to be quickly mergeable.
> 
> Any thoughts on v5-0001-*?
> 
> It would be nice to get the uncontentious part of all this (which is
> the instrumentation patch) out of the way soon.

Is https://www.postgresql.org/message-id/CAH2-WzmytCuSpaMEhv8H-jt8x_9whTi0T5bjNbH2gvaR0an2Pw%40mail.gmail.com
the last / relevant version of the patch to look at?

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Sat, Jan 21, 2023 at 6:54 PM Andres Freund <andres@anarazel.de> wrote:
> Is https://www.postgresql.org/message-id/CAH2-WzmytCuSpaMEhv8H-jt8x_9whTi0T5bjNbH2gvaR0an2Pw%40mail.gmail.com
> the last / relevant version of the patch to look at?

Yes. I'm mostly just asking about v5-0001-* right now.

-- 
Peter Geoghegan



Hi,

On 2023-01-18 17:54:27 -0800, Peter Geoghegan wrote:
> From 0afaf310255a068d3c1ca9d2ce6f00118cbff890 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Fri, 25 Nov 2022 11:23:20 -0800
> Subject: [PATCH v5 1/2] Add autovacuum trigger instrumentation.
> 
> Add new instrumentation that lists a triggering condition in the server
> log whenever an autovacuum is logged.  This reports "table age" as the
> triggering criteria when antiwraparound autovacuum runs (the XID age
> trigger case and the MXID age trigger case are represented separately).
> Other cases are reported as autovacuum trigger when the tuple insert
> thresholds or the dead tuple thresholds were crossed.
> 
> Author: Peter Geoghegan <pg@bowt.ie>
> Reviewed-By: Andres Freund <andres@anarazel.de>
> Reviewed-By: Jeff Davis <pgsql@j-davis.com>
> Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com
> ---
>  src/include/commands/vacuum.h        |  19 +++-
>  src/backend/access/heap/vacuumlazy.c |   5 ++
>  src/backend/commands/vacuum.c        |  31 ++++++-
>  src/backend/postmaster/autovacuum.c  | 124 ++++++++++++++++++---------
>  4 files changed, 137 insertions(+), 42 deletions(-)
> 
> diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
> index 689dbb770..13f70a1f6 100644
> --- a/src/include/commands/vacuum.h
> +++ b/src/include/commands/vacuum.h
> @@ -191,6 +191,21 @@ typedef struct VacAttrStats
>  #define VACOPT_SKIP_DATABASE_STATS 0x100    /* skip vac_update_datfrozenxid() */
>  #define VACOPT_ONLY_DATABASE_STATS 0x200    /* only vac_update_datfrozenxid() */
>  
> +/*
> + * Values used by autovacuum.c to tell vacuumlazy.c about the specific
> + * threshold type that triggered an autovacuum worker.
> + *
> + * AUTOVACUUM_NONE is used when VACUUM isn't running in an autovacuum worker.
> + */
> +typedef enum AutoVacType
> +{
> +    AUTOVACUUM_NONE = 0,
> +    AUTOVACUUM_TABLE_XID_AGE,
> +    AUTOVACUUM_TABLE_MXID_AGE,
> +    AUTOVACUUM_DEAD_TUPLES,
> +    AUTOVACUUM_INSERTED_TUPLES,
> +} AutoVacType;

Why is there TABLE_ in AUTOVACUUM_TABLE_XID_AGE but not
AUTOVACUUM_DEAD_TUPLES? Both are on tables.


What do you think about naming this VacuumTriggerType and adding an
VAC_TRIG_MANUAL or such?


>  /*
>   * Values used by index_cleanup and truncate params.
>   *
> @@ -222,7 +237,8 @@ typedef struct VacuumParams
>                                               * use default */
>      int            multixact_freeze_table_age; /* multixact age at which to scan
>                                               * whole table */
> -    bool        is_wraparound;    /* force a for-wraparound vacuum */
> +    bool        is_wraparound;    /* antiwraparound autovacuum? */
> +    AutoVacType trigger;        /* autovacuum trigger condition, if any */

The comment change for is_wraparound seems a bit pointless, but whatever.


> @@ -2978,7 +2995,10 @@ relation_needs_vacanalyze(Oid relid,
>                            bool *doanalyze,
>                            bool *wraparound)
>  {

The changes here are still bigger than I'd like, but ...


> -    bool        force_vacuum;
> +    TransactionId relfrozenxid = classForm->relfrozenxid;
> +    MultiXactId relminmxid = classForm->relminmxid;
> +    AutoVacType trigger = AUTOVACUUM_NONE;
> +    bool        tableagevac;

Here + below we end up with three booleans that just represent the choices in
our fancy new enum. That seems awkward to me.



> @@ -3169,14 +3212,15 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy)
>  static void
>  autovac_report_activity(autovac_table *tab)
>  {
> -#define MAX_AUTOVAC_ACTIV_LEN (NAMEDATALEN * 2 + 56)
> +#define MAX_AUTOVAC_ACTIV_LEN (NAMEDATALEN * 2 + 100)
>      char        activity[MAX_AUTOVAC_ACTIV_LEN];
>      int            len;
>  
>      /* Report the command and possible options */
>      if (tab->at_params.options & VACOPT_VACUUM)
>          snprintf(activity, MAX_AUTOVAC_ACTIV_LEN,
> -                 "autovacuum: VACUUM%s",
> +                 "autovacuum for %s: VACUUM%s",
> +                 vac_autovacuum_trigger_msg(tab->at_params.trigger),
>                   tab->at_params.options & VACOPT_ANALYZE ? " ANALYZE" : "");
>      else
>          snprintf(activity, MAX_AUTOVAC_ACTIV_LEN,

Somehow the added "for ..." sounds a bit awkward. "autovacuum for table XID
age". Maybe "autovacuum due to ..."?

Greetings,

Andres Freund



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Mon, Jan 23, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
> Why is there TABLE_ in AUTOVACUUM_TABLE_XID_AGE but not
> AUTOVACUUM_DEAD_TUPLES? Both are on tables.

Why does vacuum_freeze_table_age contain the word "table", while
autovacuum_vacuum_scale_factor does not?

To me, "table XID age" is a less awkward term for "relfrozenxid
advancing", useful in contexts where it's probably more important to
be understood by non-experts than it is to be unambiguous. Besides,
relfrozenxid works at the level of the pg_class metadata. Nothing
whatsoever needs to have changed about the table itself, nor will
anything necessarily be changed by VACUUM (except the relfrozenxid
field from pg_class).

> What do you think about naming this VacuumTriggerType and adding an
> VAC_TRIG_MANUAL or such?

But we're not doing anything with it in the context of manual VACUUMs.
I'd prefer to keep this about what autovacuum.c thought needed to
happen, at least for as long as manual VACUUMs are something that
autovacuum.c knows nothing about.

> > -     bool            force_vacuum;
> > +     TransactionId relfrozenxid = classForm->relfrozenxid;
> > +     MultiXactId relminmxid = classForm->relminmxid;
> > +     AutoVacType trigger = AUTOVACUUM_NONE;
> > +     bool            tableagevac;
>
> Here + below we end up with three booleans that just represent the choices in
> our fancy new enum. That seems awkward to me.

I don't follow. It's true that "wraparound" is still a synonym of
"tableagevac" in 0001, but that changes in 0002. And even if you
assume that 0002 won't get in, I think that it still makes sense to
structure it in a way that shows that in principle the "wraparound"
behaviors don't necessarily have to be used whenever "tableagevac" is
in use.

> > @@ -3169,14 +3212,15 @@ autovacuum_do_vac_analyze(autovac_table *tab, BufferAccessStrategy bstrategy)
> >  static void
> >  autovac_report_activity(autovac_table *tab)

> Somehow the added "for ..." sounds a bit awkward. "autovacuum for table XID
> age". Maybe "autovacuum due to ..."?

That works just as well IMV. I'll change it to that.

Anything else for 0001? Would be nice to get it committed tomorrow.

-- 
Peter Geoghegan



On Fri, Jan 20, 2023 at 4:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > It sounds like they used DROP TRIGGER pretty regularly. So I think this
> > sounds like exactly the kind of case I was talking about, where
> > autovacuums keep getting cancelled until we decide to stop cancelling
> > them.
>
> I don't know how you can reach that conclusion.

I can accept that there might be some way I'm wrong about this in
theory, but your email then seems to go on to say that I'm right just
a few sentences later:

> The whole article was about how this DROP TRIGGER pattern worked just
> fine most of the time, because most of the time autovacuum was just
> autocancelled. They say this at one point:
>
> "The normal autovacuum mechanism is skipped when locks are held in
> order to minimize service disruption. However, because transaction
> wraparound is such a severe problem, if the system gets too close to
> wraparound, an autovacuum is launched that does not back off under
> lock contention."

If this isn't arguing in favor of exactly what I'm saying, I don't
know what that would look like.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Tue, Jan 24, 2023 at 11:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > The whole article was about how this DROP TRIGGER pattern worked just
> > fine most of the time, because most of the time autovacuum was just
> > autocancelled. They say this at one point:
> >
> > "The normal autovacuum mechanism is skipped when locks are held in
> > order to minimize service disruption. However, because transaction
> > wraparound is such a severe problem, if the system gets too close to
> > wraparound, an autovacuum is launched that does not back off under
> > lock contention."
>
> If this isn't arguing in favor of exactly what I'm saying, I don't
> know what that would look like.

I'm happy to clear that up. What you said was:

"So I think this sounds like exactly the kind of case I was talking about, where
autovacuums keep getting cancelled until we decide to stop cancelling them.
If so, then they were going to have a problem whenever that happened."

Just because *some* autovacuums get cancelled doesn't mean they *all*
get cancelled. And, even if the rate is quite high, that may not be
much of a problem in itself (especially now that we have the freeze
map). 200 million XIDs usually amounts to a lot of wall clock time.
Even if it is rather difficult to finish up, we only have to get lucky
once.

The fact that autovacuum eventually got to the point of requiring an
antiwraparound autovacuum on the problematic table does indeed
strongly suggest that any other, earlier autovacuums were relatively
unlikely to have advanced relfrozenxid in the end -- or at least
couldn't on this one occasion. But that in itself is just not relevant
to our current discussion, since even the tiniest perturbation would
have been enough to prevent a non-aggressive VACUUM from being able to
advance relfrozenxid. Before 15, non-aggressive VACUUMs would throw
away the opportunity to do so just because they couldn't immediately
get a cleanup lock on one single heap page.

It's quite possible that most or all prior aggressive VACUUMs were not
antiwraparound autovacuums, because the dead tuples accounting was
enough to launch an autovacuum at some point after age(relfrozenxid)
exceeded vacuum_freeze_table_age that was still before it could reach
autovacuum_freeze_max_age. That would give you a cancellable
aggressive VACUUM -- a VACUUM that actually has a non-zero chance of
advancing relfrozenxid.

Sure, it's possible that such a cancellable aggressive autovacuum was
indeed cancelled, and that that factor made the crucial difference.
But I find it far easier to believe that there simply was no such
aggressive autovacuum in the first place (not this time), since it
could have only happened when autovacuum thinks that there are
sufficiently many dead tuples to justify launching an autovacuum in
the first place. Which, as we now all accept, is based on highly
dubious sampling by ANALYZE. So I think it's much more likely to be
that factor (dead tuple accounting is bad), as well as the absurd
false dichotomy between aggressive and non-aggressive -- plus the
issue at hand, the auto-cancellation behavior.

I don't claim to know what is inevitable, or what is guaranteed to
work or not work. I only claim that we can meaningfully reduce the
absolute risk by using a fairly simple approach, principally by not
needlessly coupling the auto-cancellation behavior to *all*
autovacuums that are specifically triggered by age(relfrozenxid). As
Andres said at one point, doing those two things at exactly the same
time is just arbitrary.

--
Peter Geoghegan



Hi,

On 2023-01-23 19:22:18 -0800, Peter Geoghegan wrote:
> On Mon, Jan 23, 2023 at 6:56 PM Andres Freund <andres@anarazel.de> wrote:
> > Why is there TABLE_ in AUTOVACUUM_TABLE_XID_AGE but not
> > AUTOVACUUM_DEAD_TUPLES? Both are on tables.
> 
> Why does vacuum_freeze_table_age contain the word "table", while
> autovacuum_vacuum_scale_factor does not?

I don't know. But that's not really a reason to introduce more oddities.


> To me, "table XID age" is a less awkward term for "relfrozenxid
> advancing", useful in contexts where it's probably more important to
> be understood by non-experts than it is to be unambiguous. Besides,
> relfrozenxid works at the level of the pg_class metadata. Nothing
> whatsoever needs to have changed about the table itself, nor will
> anything necessarily be changed by VACUUM (except the relfrozenxid
> field from pg_class).

I'd just go for "xid age", I don't see a point in adding 'table', particularly
when you don't for dead tuples.


> > What do you think about naming this VacuumTriggerType and adding an
> > VAC_TRIG_MANUAL or such?
> 
> But we're not doing anything with it in the context of manual VACUUMs.

It's a member of a struct passed to the routines handling both manual and
interactive vacuum.  And we could e.g. eventually start replace
IsAutoVacuumWorkerProcess() checks with it - which aren't e.g. going to work
well if we add parallel index vacuuming support to autovacuum.



> I'd prefer to keep this about what autovacuum.c thought needed to
> happen, at least for as long as manual VACUUMs are something that
> autovacuum.c knows nothing about.

It's an enum defined in a general header, not something in autovacuum.c - so I
don't really buy this.


> > > -     bool            force_vacuum;
> > > +     TransactionId relfrozenxid = classForm->relfrozenxid;
> > > +     MultiXactId relminmxid = classForm->relminmxid;
> > > +     AutoVacType trigger = AUTOVACUUM_NONE;
> > > +     bool            tableagevac;
> >
> > Here + below we end up with three booleans that just represent the choices in
> > our fancy new enum. That seems awkward to me.
> 
> I don't follow. It's true that "wraparound" is still a synonym of
> "tableagevac" in 0001, but that changes in 0002. And even if you
> assume that 0002 won't get in, I think that it still makes sense to
> structure it in a way that shows that in principle the "wraparound"
> behaviors don't necessarily have to be used whenever "tableagevac" is
> in use.

You have booleans tableagevac, deadtupvac, inserttupvac. Particularly the
latter ones really are just a rephrasing of the trigger:

+    tableagevac = true;
+    *wraparound = false;
+    /* See header comments about trigger precedence */
+    if (TransactionIdIsNormal(relfrozenxid) &&
+        TransactionIdPrecedes(relfrozenxid, xidForceLimit))
+        trigger = AUTOVACUUM_TABLE_XID_AGE;
+    else if (MultiXactIdIsValid(relminmxid) &&
+             MultiXactIdPrecedes(relminmxid, multiForceLimit))
+        trigger = AUTOVACUUM_TABLE_MXID_AGE;
+    else
+        tableagevac = false;
+
+    /* User disabled non-table-age autovacuums in pg_class.reloptions? */
+    if (!av_enabled && !tableagevac)

...

+        deadtupvac = (vactuples > vacthresh);
+        inserttupvac = (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
+        /* See header comments about trigger precedence */
+        if (!tableagevac)
+        {
+            if (deadtupvac)
+                trigger = AUTOVACUUM_DEAD_TUPLES;
+            else if (inserttupvac)
+                trigger = AUTOVACUUM_INSERTED_TUPLES;
+        }
+
         /* Determine if this table needs vacuum or analyze. */
-        *dovacuum = force_vacuum || (vactuples > vacthresh) ||
-            (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
+        *dovacuum = (tableagevac || deadtupvac || inserttupvac);


I find this to be awkward code. The booleans are kinda pointless, and the
tableagevac case is hard to follow because trigger is set elsewhere.

I can give reformulating it a go. Need to make some food first.


I suspect that the code would look better if we didn't continue to have
"bool *dovacuum" and the trigger. They're redundant.


> Anything else for 0001? Would be nice to get it committed tomorrow.

Sorry, today was busy with meetings and bashing my head against AIX.

Greetings,

Andres Freund



Hi,

On 2023-01-24 20:59:04 -0800, Andres Freund wrote:
> I find this to be awkward code. The booleans are kinda pointless, and the
> tableagevac case is hard to follow because trigger is set elsewhere.
>
> I can give reformulating it a go. Need to make some food first.

Here's a draft of what I am thinking of. Not perfect yet, but I think it looks
better.

The pg_stat_activity output looks like this right now:

autovacuum due to table XID age: VACUUM public.pgbench_accounts (to prevent wraparound)

Why don't we drop the "(to prevent wraparound)" now?

And I still think removing the "table " bit would be an improvement.

Greetings,

Andres Freund

Attachment
On Tue, Jan 24, 2023 at 3:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Sure, it's possible that such a cancellable aggressive autovacuum was
> indeed cancelled, and that that factor made the crucial difference.
> But I find it far easier to believe that there simply was no such
> aggressive autovacuum in the first place (not this time), since it
> could have only happened when autovacuum thinks that there are
> sufficiently many dead tuples to justify launching an autovacuum in
> the first place. Which, as we now all accept, is based on highly
> dubious sampling by ANALYZE. So I think it's much more likely to be
> that factor (dead tuple accounting is bad), as well as the absurd
> false dichotomy between aggressive and non-aggressive -- plus the
> issue at hand, the auto-cancellation behavior.

In my opinion, this is too speculative to justify making changes to
the behavior. I'm not here to defend the way we do dead tuple
accounting. I think it's a hot mess. But whether or not it played any
role in this catastrophe is hard to say. The problems with dead tuple
accounting are, as I understand it, all about statistical
independence. That is, we make assumptions that what's true of the
sample is likely to be true of the whole table when in reality it may
not be true at all. Perhaps it's even unlikely to be true. But the
kinds of problems you get from assuming statistical independence tend
to hit users very unevenly. We make similar assumptions about
selectivity estimation: unless there are extended statistics, we take
P(a=1 and b=1) = P(a=1)*P(b = 1), which can be vastly and dramatically
wrong. People can and do get extremely bad query plans as a result of
that assumption. However, other people run PostgreSQL for years and
years and never really have big problems with it. I'd say that it's
completely fair to describe this as a big problem, but we can't
therefore conclude that some particular user has this problem, not
even if we know that they have a slow query and we know that it's due
to a bad plan. And similarly here, I don't see a particular reason to
think that your theory about what happened is more likely than mine. I
freely admit that yours could be right, just as you admitted that mine
could be right. But I think we really just don't know.

It feels unlikely to me that there was ONE cancellable aggressive
autovacuum and that it got cancelled. I think that it's probably
either ZERO or LOTS, depending on whether the dead tuple threshold was
ever reached. If it wasn't, then it must have been zero. But if it
was, and the first autovacuum worker to visit that table got
cancelled, then the next one would try again. And
autovacuum_naptime=1m, so if we're not out of autovacuum workers,
we're going to retry that table every minute. If we do run out of
autovacuum workers, which is pretty likely, we'll still launch new
workers in that database as often as we can given when other workers
exit. If the system is very tight on autovacuum capacity, probably
because the cost limit is too low, then you could have a situation
where only one try gets made before we hit autovacuum_freeze_max_age.
Otherwise, though, a single failed try would probably lead to trying a
whole lot more times after that, and you only hit
autovacuum_freeze_max_age if all those attempts fail.

At the risk of repeating myself, here's what bugs me. If we suppose
that your intuition is right and no aggressive autovacuum happened
before autovacuum_freeze_max_age was reached, then what you are
proposing will make things better. But if we suppose that my intuition
is right and many aggressive autovacuums happened before
autovacuum_freeze_max_age was reached, then what you are proposing
will make things worse, because if we've been auto-cancelling
repeatedly we're probably going to keep doing so until we shut that
behavior off, and we want a vacuum to succeed sooner rather than
later. So it doesn't feel obvious to me that we should change
anything. Even if we knew what had happened for certain in this
particular case, I don't know how we can possibly know what is typical
in similar cases.

My personal support experience has been that cases where autovacuum
runs a lot but doesn't solve the problem for some reason are a lot
more common than cases where it doesn't run when it should have done.
That probably accounts for my intuition about what is likely to have
happened here. But as my colleagues are often at pains to point out to
me, my experiences aren't representative of what happens to PostgreSQL
users generally for all kinds of reasons, and therefore sometimes my
intuition is wrong. But since I have nobody else's experiences to use
in lieu of my own, I don't know what else I can use to judge anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

From
Peter Geoghegan
Date:
On Wed, Jan 25, 2023 at 1:19 AM Andres Freund <andres@anarazel.de> wrote:
> Here's a draft of what I am thinking of. Not perfect yet, but I think it looks
> better.

I'm afraid that I will be unable to do any more work on this project.
I have withdrawn it from the CF app.

If you would like to complete some or all of the patches yourself, in
part or in full, I have no objections.

-- 
Peter Geoghegan