Thread: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

I wonder why we're counting the number of dead tuples (or LP_DEAD stub
items) in the relation as a whole in ANALYZE's acquire_sample_rows()
function. Wouldn't it make more sense to focus on the "live vs dead
tuple properties" of heap pages that are not known to be all-visible
when we generate statistics for our pgstat_report_analyze() report?
These statistic collector stats are only for the benefit of autovacuum
scheduling -- and so they're *consumed* in a way that is totally
different to the nearby pg_statistic stats.

There is no good reason for the pgstat_report_analyze() stats to be
based on the same pg_class.relpages "denominator" as the pg_statistic
stats (it's just slightly easier to do it that way in
acquire_sample_rows(), I suppose). On the other hand, an alternative
behavior involving counting totaldeadrows against sampled
not-all-visible pages (but not otherwise) has a big benefit: doing so
would remove any risk that older/earlier PageIsAllVisible() pages will
bias ANALYZE in the direction of underestimating the count. This isn't
a theoretical benefit -- I have tied it to an issue with the
BenchmarkSQL TPC-C implementation [1].

This approach just seems natural to me. VACUUM intrinsically only
expects dead tuples/line pointers in not-all-visible pages. So
PageIsAllVisible() pages should not be counted here -- they are simply
irrelevant, because these stats are for autovacuum, and autovacuum
thinks they're irrelevant. What's more, VACUUM currently uses
vac_estimate_reltuples() to compensate for the fact that it skips some
pages using the visibility map -- pgstat_report_vacuum() expects a
whole-relation estimate. But if
pgstat_report_vacuum()/pgstat_report_analyze() expected statistics
about the general properties of live vs dead tuples (or LP_DEAD items)
on not-all-visible pages in the first place, then we wouldn't need to
compensate like this.

This new approach also buys us the ability to extrapolate a new
estimated number of dead tuples using old, stale stats. The stats can
be combined with the authoritative/known number of not-all-visible
pages right this second, since it's cheap enough to *accurately*
determine the total number of not-all-visible pages for a heap
relation by calling visibilitymap_count(). My guess is that this would
be much more accurate in practice: provided the original average
number of dead/live tuples (tuples per not-all-visible block) was
still reasonably accurate, the extrapolated "total dead tuples right
now" values would also be accurate.

I'm glossing over some obvious wrinkles here, such as: what happens to
totaldeadrows when 100% of all the pages ANALYZE samples are
PageIsAllVisible() pages? I think that it shouldn't be too hard to
come up with solutions to those problems (the extrapolation idea
already hints at a solution), but for now I'd like to keep the
discussion high level.

[1] https://postgr.es/m/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com
-- 
Peter Geoghegan



On Sun, Dec 5, 2021 at 12:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I wonder why we're counting the number of dead tuples (or LP_DEAD stub
> items) in the relation as a whole in ANALYZE's acquire_sample_rows()
> function. Wouldn't it make more sense to focus on the "live vs dead
> tuple properties" of heap pages that are not known to be all-visible
> when we generate statistics for our pgstat_report_analyze() report?
> These statistic collector stats are only for the benefit of autovacuum
> scheduling -- and so they're *consumed* in a way that is totally
> different to the nearby pg_statistic stats.

I think this could be the right idea. I'm not certain that it is, but
it does sound believable.

> This new approach also buys us the ability to extrapolate a new
> estimated number of dead tuples using old, stale stats. The stats can
> be combined with the authoritative/known number of not-all-visible
> pages right this second, since it's cheap enough to *accurately*
> determine the total number of not-all-visible pages for a heap
> relation by calling visibilitymap_count(). My guess is that this would
> be much more accurate in practice: provided the original average
> number of dead/live tuples (tuples per not-all-visible block) was
> still reasonably accurate, the extrapolated "total dead tuples right
> now" values would also be accurate.

So does this. If some of the table is now all-visible when it wasn't
before, it's certainly a good guess that the portions that still
aren't have about the same distribution of dead tuples that they did
before ... although the other direction is less clear: it seems
possible that newly not-all-visible pages have fewer dead tuples than
ones which have been not-all-visible for a while. But you have to make
some guess.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Dec 6, 2021 at 12:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
> So does this. If some of the table is now all-visible when it wasn't
> before, it's certainly a good guess that the portions that still
> aren't have about the same distribution of dead tuples that they did
> before ... although the other direction is less clear: it seems
> possible that newly not-all-visible pages have fewer dead tuples than
> ones which have been not-all-visible for a while. But you have to make
> some guess.

To me, it seems natural to accept and even embrace the inherent
uncertainty about the number of dead tuples. We should model our
current belief about how many dead tuples are in the table as a
probability density function (or something along the same lines).
There is a true "sample space" here. Once we focus on not-all-visible
pages, using authoritative VM info, many kinds of misestimations are
clearly impossible. For example, there are only so many
not-all-visible heap pages, and they can only hold so many dead tuples
(up to MaxHeapTuplesPerPage). This is a certainty.

The number of dead tuples in the table is an inherently dynamic thing,
which makes it totally dissimilar to the pg_statistics-based stats.
And so a single snapshot of a point in time is inherently much less
useful -- we ought to keep a few sets of old statistics within our new
pgstat_report_analyze() -- maybe 3 or 5. Each set of statistics
includes the total number of relpages at the time, the total number of
not-all-visible pages (i.e. interesting pages) at the time, and the
average number of live and dead tuples encountered. This is
interpreted (along with a current visibilitymap_count()) to get our
so-called probability density function (probably not really a PDF,
probably something simpler and only vaguely similar) within
autovacuum.c.

It just occurred to me that it makes zero sense that
pgstat_report_vacuum() does approximately the same thing as
pgstat_report_analyze() -- we make no attempt to compensate for the
fact that the report is made by VACUUM specifically, and so reflects
the state of each page in the table immediately after it was processed
by VACUUM. ISTM that this makes it much more likely to appear as an
underestimate later on --  pgstat_report_vacuum() gets the furthest
possible thing from a random sample. Whereas if we had more context
(specifically that there are very few or even 0 all-visible pages), it
wouldn't hurt us at all, and we wouldn't need to have special
pgstat_report_vacuum()-only heuristics.

-- 
Peter Geoghegan



On Mon, Dec 6, 2021 at 2:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Dec 6, 2021 at 12:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > So does this. If some of the table is now all-visible when it wasn't
> > before, it's certainly a good guess that the portions that still
> > aren't have about the same distribution of dead tuples that they did
> > before ... although the other direction is less clear: it seems
> > possible that newly not-all-visible pages have fewer dead tuples than
> > ones which have been not-all-visible for a while. But you have to make
> > some guess.
>
> To me, it seems natural to accept and even embrace the inherent
> uncertainty about the number of dead tuples.

> The number of dead tuples in the table is an inherently dynamic thing,
> which makes it totally dissimilar to the pg_statistics-based stats.
> And so a single snapshot of a point in time is inherently much less
> useful -- we ought to keep a few sets of old statistics within our new
> pgstat_report_analyze() -- maybe 3 or 5.

I just realized that I didn't really get around to explicitly
connecting this to your point about newly not-all-visible pages being
quite different to older ones that ANALYZE has seen -- which is
definitely an important consideration. I'll do so now:

Keeping some history makes the algorithm "less gullible" (a more
useful goal than making it "smarter", at least IMV). Suppose that our
starting point is 2 pieces of authoritative information, which are
current as of the instant we want to estimate the number of dead
tuples for VACUUM: 1. total relation size (relpages), and 2. current
not-all-visible-pages count (interesting/countable pages, calculated
by taking the "complement" of visibilitymap_count() value). Further
suppose we store the same 2 pieces of information in our ANALYZE
stats, reporting using pgstat_report_analyze() -- the same 2 pieces of
information are stored alongside the actual count of dead tuples and
live tuples found on not-all-visible pages.

The algorithm avoids believing silly things about dead tuples by
considering the delta between each piece of information, particularly
the difference between "right now" and "the last time ANALYZE ran and
called pgstat_report_analyze()". For example, if item 1/relpages
increased by exactly the same number of blocks as item
2/not-all-visible pages (or close enough to it), that is recognized as
a pretty strong signal. The algorithm should consider the newly
not-all-visible pages as likely to have very few dead tuples. At the
same time, the algorithm should not change its beliefs about the
concentration of dead tuples in remaining, older not-all-visible
pages.

This kind of thing will still have problems, no doubt. But I'd much
rather err in the direction of over-counting dead tuples like this.
The impact of the problem on the workload/autovacuum is a big part of
the picture here.

Suppose we believe that not-all-visible pages have 20 LP_DEAD items on
average, and they turn out to only have 3 or 5. Theoretically we've
done the wrong thing by launching autovacuum workers sooner -- we
introduce bias. But we also have lower variance over time, which might
make it worth it. I also think that it might not really matter at all.
It's no great tragedy if we clean up and set pages all-visible in the
visibility map a little earlier on average. It might even be a
positive thing.

The fact that the user expresses the dead-tuple-wise threshold using
autovacuum_vacuum_scale_factor is already somewhat arbitrary -- it is
based on some pretty iffy assumptions. Even if we greatly overestimate
dead tuples with the new algorithm, we're only doing so under
circumstances that might have caused
autovacuum_vacuum_insert_scale_factor to launch an autovacuum worker
anyway. Just setting the visibility map bit has considerable value.

--
Peter Geoghegan



On Mon, Dec 6, 2021 at 8:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Suppose we believe that not-all-visible pages have 20 LP_DEAD items on
> average, and they turn out to only have 3 or 5. Theoretically we've
> done the wrong thing by launching autovacuum workers sooner -- we
> introduce bias. But we also have lower variance over time, which might
> make it worth it. I also think that it might not really matter at all.
> It's no great tragedy if we clean up and set pages all-visible in the
> visibility map a little earlier on average. It might even be a
> positive thing.

This doesn't seem convincing. Launching autovacuum too soon surely has
costs that someone might not want to pay. Clearly in the degenerate
case where we always autovacuum every table every time an autovacuum
worker is launched, we have gone insane. So arbitrarily large moves in
that direction can't be viewed as unproblematic.

> The fact that the user expresses the dead-tuple-wise threshold using
> autovacuum_vacuum_scale_factor is already somewhat arbitrary -- it is
> based on some pretty iffy assumptions. Even if we greatly overestimate
> dead tuples with the new algorithm, we're only doing so under
> circumstances that might have caused
> autovacuum_vacuum_insert_scale_factor to launch an autovacuum worker
> anyway. Just setting the visibility map bit has considerable value.

Now, on the other hand, I *most definitely* think
autovacuum_vacuum_scale_factor is hogwash. Everything I've seen
indicates that, while you do want to wait for a larger number of dead
tuples in a large table than in a small one, it's sublinear. I don't
know whether it's proportional to sqrt(table_size) or table_size^(1/3)
or lg(table_size) or table_size^(0.729837166538), but just plain old
table_size is definitely not it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Dec 6, 2021 at 6:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> This doesn't seem convincing. Launching autovacuum too soon surely has
> costs that someone might not want to pay. Clearly in the degenerate
> case where we always autovacuum every table every time an autovacuum
> worker is launched, we have gone insane.

Unfortunately we already sometimes behave insanely in exactly the way
that you describe:

https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com

That is, in addition to the problem that I'm highlighting on this
thread, we also have the opposite problem: autovacuum chases its tail
when it sees dead heap-only tuples that opportunistic pruning can take
care of on its own. I bet that both effects sometimes cancel each
other out, in weird and unpredictable ways. This effect might be
protective at first, and then less protective.

> So arbitrarily large moves in
> that direction can't be viewed as unproblematic.

I certainly wouldn't argue that they are. Just that the current
approach of simply counting dead tuples in the table (or trying to,
using sampling) and later launching autovacuum (when dead tuples
crosses a pretty arbitrary threshold) has many problems -- problems
that make us either run autovacuum too aggressively, and not
aggressively enough (relative to what the docs suggest is supposed to
happen).

> Now, on the other hand, I *most definitely* think
> autovacuum_vacuum_scale_factor is hogwash. Everything I've seen
> indicates that, while you do want to wait for a larger number of dead
> tuples in a large table than in a small one, it's sublinear. I don't
> know whether it's proportional to sqrt(table_size) or table_size^(1/3)
> or lg(table_size) or table_size^(0.729837166538), but just plain old
> table_size is definitely not it.

I think that it'll vary considerably, based on many factors. Making
the precise threshold "open to interpretation" to some degree (by
autovacuum.c) seems like it might help us with future optimizations.

It's hard to define a break-even point for launching an autovacuum
worker. I think it would be more productive to come up with a design
that at least doesn't go completely off the rails in various specific
ways. I also think that our problem is not so much that we don't have
accurate statistics about dead tuples (though we surely don't have
much accuracy). The main problem seems to be that there are various
specific, important ways in which the distribution of dead tuples may
matter (leading to various harmful biases). And so it seems reasonable
to fudge how we interpret dead tuples with the intention of capturing
some of that, as a medium term solution. Something that considers the
concentration of dead tuples in heap pages seems promising.

--
Peter Geoghegan



On Mon, Dec 6, 2021 at 9:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Dec 6, 2021 at 6:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > This doesn't seem convincing. Launching autovacuum too soon surely has
> > costs that someone might not want to pay. Clearly in the degenerate
> > case where we always autovacuum every table every time an autovacuum
> > worker is launched, we have gone insane.
>
> Unfortunately we already sometimes behave insanely in exactly the way
> that you describe:
>
> https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com

In the same ballpark is http://rhaas.blogspot.com/2020/02/useless-vacuuming.html

> It's hard to define a break-even point for launching an autovacuum
> worker. I think it would be more productive to come up with a design
> that at least doesn't go completely off the rails in various specific
> ways.

I think that's a good observation. I think the current autovacuum
algorithm works pretty well when things are normal. But in extreme
scenarios it does not degrade gracefully. The
vacuuming-over-and-over-without-doing-anything phenomenon is an
example of that. Another example is the user who creates 10,000
databases and we're happy to divide the 60s-autovacuum_naptime by
10,000 and try to launch a worker every 0.6 ms. A third example is
vacuuming the tables from pg_class in physical order on disk, so that
a table that is 1 XID past the wraparound limit can result in a long
delay vacuuming a table that is bloating quickly, or conversely a
table that is bloating very slowly but has just crawled over the
threshold for a regular vacuum gets processed before one that is
threatening an imminent wraparound shutdown. I think these are all
pathological cases that a well-informed human can easily recognize and
handle in an intelligent manner, and it doesn't seem crazy to program
those responses into the computer in some way.

> I also think that our problem is not so much that we don't have
> accurate statistics about dead tuples (though we surely don't have
> much accuracy). The main problem seems to be that there are various
> specific, important ways in which the distribution of dead tuples may
> matter (leading to various harmful biases). And so it seems reasonable
> to fudge how we interpret dead tuples with the intention of capturing
> some of that, as a medium term solution. Something that considers the
> concentration of dead tuples in heap pages seems promising.

I am less convinced about this part. It sort of sounds like you're
saying - it doesn't really matter whether the numbers we gather are
accurate, just that they produce the correct results. If the
information we currently gather wouldn't produce the right results
even if it were fully accurate, that to me suggests that we're
gathering the wrong information, and we should gather something else.
For example, we could attack the useless-vacuuming problem by having
each vacuum figure out - and store - the oldest XID that could
potentially be worth using as a cutoff for vacuuming that table, and
refusing to autovacuum that table again until we'd be able to use a
cutoff >= that value. I suppose this would be the oldest of (a) any
XID that exists in the table on a tuple that we judged recently dead,
(b) any XID that is currently-running, and (c) the next XID.

I also accept that knowing the concentration of dead tuples on pages
that contain at least 1 dead tuple could be interesting. I've felt for
a while that it's a mistake to know how many dead tuples there are but
not how many pages contain them, because that affects both the amount
of I/O required to vacuum and also how much need we have to set VM
bits. I'm not sure I would have approached gathering that information
in the way that you're proposing here, but I'm not deeply against it,
either. I do think that we should try to keep it as a principle that
whatever we do gather, we should try our best to make accurate. If
that doesn't work well, then gather different stuff instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Dec 7, 2021 at 7:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think that's a good observation. I think the current autovacuum
> algorithm works pretty well when things are normal. But in extreme
> scenarios it does not degrade gracefully.

+1 to all of the specific examples you go on to describe.

> > I also think that our problem is not so much that we don't have
> > accurate statistics about dead tuples (though we surely don't have
> > much accuracy). The main problem seems to be that there are various
> > specific, important ways in which the distribution of dead tuples may
> > matter (leading to various harmful biases). And so it seems reasonable
> > to fudge how we interpret dead tuples with the intention of capturing
> > some of that, as a medium term solution. Something that considers the
> > concentration of dead tuples in heap pages seems promising.
>
> I am less convinced about this part. It sort of sounds like you're
> saying - it doesn't really matter whether the numbers we gather are
> accurate, just that they produce the correct results.

That's not quite it. More like: I think that it would be reasonable to
define dead tuples abstractly, in a way that's more useful but
nevertheless cannot diverge too much from the current definition. We
should try to capture "directionality", with clear error bounds. This
won't represent the literal truth, but it will be true in spirit (or
much closer).

For example, why should we count dead heap-only tuples from earlier in
a HOT chain, even when we see no evidence that opportunistic HOT
pruning can't keep up on that page? Since we actually care about the
direction of things, not just the present state of things, we'd be
justified in completely ignoring those dead tuples. Similarly, it
might well make sense to give more weight to concentrations of LP_DEAD
items on a page -- that is a signal that things are not going well *at
the level of the page*. Not so much when you have a few LP_DEAD stubs,
but certainly when you have dozens of them on one page, or even
hundreds. And so ISTM that the conditions of the page should influence
how we interpret/count that page's dead tuples, in both directions
(interpret the page as having more dead tuples, or fewer).

We all know that there isn't a sensible answer to the question "if the
abstract cost units used in the optimizer say that one sequential page
access is 4x cheaper than one random page access, then what's the
difference between accessing 10 random pages sequentially in close
succession, versus accessing the same 10 pages randomly?". But at the
same time, we can say for sure that random is more expensive to *some*
degree, but certainly never by multiple orders of magnitude. The model
is imperfect, to be sure, but it is better than many alternative
models that are also technically possible. We need *some* cost model
for a cost-based optimizer, and it is better to be approximately
correct than exactly wrong. Again, the truly important thing is to not
be totally wrong about any one thing.

Another way of putting it is that I am willing to accept a more biased
definition of dead tuples, if that lowers the variance:

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

We have some certainty about what is possible in a world in which we
use the visibility map directly, and so our final estimate should
never be wildly unreasonable -- the abstract idea of dead tuples can
still be anchored to the physical reality.

As with the optimizer and its cost units, we don't have that many
degrees of freedom when autovacuum.c launches a worker, any I don't
see that changing -- we must ultimately decide to launch or not launch
an autovacuum worker (for any given table) each time the autovacuum
launcher wakes up. So we're practically forced to come up with a model
that has some abstract idea of one kind of bloat/dead tuple. I would
say that we already have one, in fact -- it's just not a very good one
because it doesn't take account of obviously-relevant factors like
HOT. It could quite easily be less wrong.

> If the
> information we currently gather wouldn't produce the right results
> even if it were fully accurate, that to me suggests that we're
> gathering the wrong information, and we should gather something else.

I think that counting dead tuples is useful, but not quite sufficient
on its own. At the same time, we still need something that works like
a threshold -- because that's just how the autovacuum.c scheduling
works. It's a go/no-go decision, regardless of how the decision is
made.

> For example, we could attack the useless-vacuuming problem by having
> each vacuum figure out - and store - the oldest XID that could
> potentially be worth using as a cutoff for vacuuming that table, and
> refusing to autovacuum that table again until we'd be able to use a
> cutoff >= that value. I suppose this would be the oldest of (a) any
> XID that exists in the table on a tuple that we judged recently dead,
> (b) any XID that is currently-running, and (c) the next XID.

I agree that that's a good idea, but it seems like something that only
augments what I'm talking about. I suppose that it might become
necessary if we get something along the lines of what we've been
discussing, though.

> I also accept that knowing the concentration of dead tuples on pages
> that contain at least 1 dead tuple could be interesting. I've felt for
> a while that it's a mistake to know how many dead tuples there are but
> not how many pages contain them, because that affects both the amount
> of I/O required to vacuum and also how much need we have to set VM
> bits.

Right. And as I keep saying, the truly important thing is to not
*completely* ignore any relevant dimension of cost. I just don't want
to ever be wildly wrong -- not even once. We can tolerate being
somewhat less accurate all the time (not that we necessarily have to
make a trade-off), but we cannot tolerate pathological behavior. Of
course I include new/theoretical pathological behaviors here (not just
the ones we know about today).

> I'm not sure I would have approached gathering that information
> in the way that you're proposing here, but I'm not deeply against it,
> either. I do think that we should try to keep it as a principle that
> whatever we do gather, we should try our best to make accurate. If
> that doesn't work well, then gather different stuff instead.

It's important to be accurate, but it's also important to be tolerant
of model error, which is inevitable. We should make pragmatic
decisions about what kinds of errors our new model will have. And it
should have at least a rudimentary ability to learn from its mistakes.

-- 
Peter Geoghegan



On Tue, Dec 7, 2021 at 2:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> For example, why should we count dead heap-only tuples from earlier in
> a HOT chain, even when we see no evidence that opportunistic HOT
> pruning can't keep up on that page? Since we actually care about the
> direction of things, not just the present state of things, we'd be
> justified in completely ignoring those dead tuples. Similarly, it
> might well make sense to give more weight to concentrations of LP_DEAD
> items on a page -- that is a signal that things are not going well *at
> the level of the page*. Not so much when you have a few LP_DEAD stubs,
> but certainly when you have dozens of them on one page, or even
> hundreds. And so ISTM that the conditions of the page should influence
> how we interpret/count that page's dead tuples, in both directions
> (interpret the page as having more dead tuples, or fewer).

Well... I mean, I think we're almost saying the same thing, then, but
I think you're saying it more confusingly. I have no objection to
counting the number of dead HOT chains rather than the number of dead
tules, because that's what affects the index contents, but there's no
need to characterize that as "not the literal truth." There's nothing
fuzzy or untrue about it if we simply say that's what we're doing.

> Right. And as I keep saying, the truly important thing is to not
> *completely* ignore any relevant dimension of cost. I just don't want
> to ever be wildly wrong -- not even once. We can tolerate being
> somewhat less accurate all the time (not that we necessarily have to
> make a trade-off), but we cannot tolerate pathological behavior. Of
> course I include new/theoretical pathological behaviors here (not just
> the ones we know about today).

Sure, but we don't *need* to be less accurate, and I don't think we
even *benefit* from being less accurate. If we do something like count
dead HOT chains instead of dead tuples, let's not call that a
less-accurate count of dead tuples. Let's call it an accurate count of
dead HOT chains.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Dec 7, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Well... I mean, I think we're almost saying the same thing, then, but
> I think you're saying it more confusingly. I have no objection to
> counting the number of dead HOT chains rather than the number of dead
> tules, because that's what affects the index contents, but there's no
> need to characterize that as "not the literal truth."

Works for me!

> Sure, but we don't *need* to be less accurate, and I don't think we
> even *benefit* from being less accurate. If we do something like count
> dead HOT chains instead of dead tuples, let's not call that a
> less-accurate count of dead tuples. Let's call it an accurate count of
> dead HOT chains.

Fair enough, but even then we still ultimately have to generate a
final number that represents how close we are to a configurable "do an
autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based
threshold) -- the autovacuum.c side of this (the consumer side)
fundamentally needs the model to reduce everything to a one
dimensional number (even though the reality is that there isn't just
one dimension). This single number (abstract bloat units, abstract
dead tuples, whatever) is a function of things like the count of dead
HOT chains, perhaps the concentration of dead tuples on heap pages,
whatever -- but it's not the same thing as any one of those things we
count.

I think that this final number needs to be denominated in abstract
units -- we need to call those abstract units *something*. I don't
care what that name ends up being, as long as it reflects reality.

-- 
Peter Geoghegan



On Tue, Dec 7, 2021 at 3:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Fair enough, but even then we still ultimately have to generate a
> final number that represents how close we are to a configurable "do an
> autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based
> threshold) -- the autovacuum.c side of this (the consumer side)
> fundamentally needs the model to reduce everything to a one
> dimensional number (even though the reality is that there isn't just
> one dimension). This single number (abstract bloat units, abstract
> dead tuples, whatever) is a function of things like the count of dead
> HOT chains, perhaps the concentration of dead tuples on heap pages,
> whatever -- but it's not the same thing as any one of those things we
> count.
>
> I think that this final number needs to be denominated in abstract
> units -- we need to call those abstract units *something*. I don't
> care what that name ends up being, as long as it reflects reality.

If we're only trying to decide whether or not to vacuum a table, we
don't need units: the output is a Boolean. If we're trying to decide
on an order in which to vacuum tables, then we need units. But such
units can't be anything related to dead tuples, because vacuum can be
needed based on XID age, or MXID age, or dead tuples. The units would
have to be something like abstract vacuum-urgency units (if higher is
more urgent) or abstract remaining-headroom-beform-catastrophe units
(if lower is more urgent).

Ignoring wraparound considerations for a moment, I think that we might
want to consider having multiple thresholds and vacuuming the table if
any one of them are met. For example, suppose a table qualifies for
vacuum when %-of-not-all-visible-pages > some-threshold, or
alternatively when %-of-index-tuples-thought-to-be-dead >
some-other-threshold.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Dec 7, 2021 at 1:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
> If we're only trying to decide whether or not to vacuum a table, we
> don't need units: the output is a Boolean.

I was imagining a world in which we preserve the
autovacuum_vacuum_scale_factor design, but interpret it creatively
(but never too creatively) -- an incremental approach seems best to
me. We can even sanity check our abstract bloat unit calculation, in
case the page-level sampling aggregates into a totally wild number of
dead tuples (based in part on the current number of not-all-visible
heap pages) -- so the abstract units are always anchored to the old
idea of dead tuples. Maybe this isn't the best approach, but at least
it addresses compatibility.

*Any* approach based on sampling relatively few random blocks (to look
for signs of bloat) is inherently prone to hugely underestimating the
extent of bloat (which is what we see in TPC-C). I am primarily
concerned about compensating for the inherent limitations that go with
that. To me it seems inappropriate to make statistical inferences
about dead tuples based on a random snapshot of random blocks (usually
only a tiny minority). It is not only possible for the picture to
change utterly -- it is routine, expected, and the whole entire point.

The entire intellectual justification for statistical sampling (that
mostly works for optimizer stats) just doesn't carry over to
autovacuum stats, for many reasons. At the same time, I don't have any
fundamentally better starting point. That's how I arrived at the idea
of probabilistic modeling based on several recent snapshots from
ANALYZE. The statistics are often rubbish, whether or not we like it,
and regardless of how we decide to count things on each page. And so
it's entirely reasonable to not limit the algorithm to concerns about
the state of things -- the actual exposure of the system to harm (from
overlooking harmful bloat) is also relevant.

> If we're trying to decide
> on an order in which to vacuum tables, then we need units. But such
> units can't be anything related to dead tuples, because vacuum can be
> needed based on XID age, or MXID age, or dead tuples. The units would
> have to be something like abstract vacuum-urgency units (if higher is
> more urgent) or abstract remaining-headroom-beform-catastrophe units
> (if lower is more urgent).

I like that idea. But I wonder if they should be totally unrelated. If
we're close to the "emergency" XID threshold, and also close to the
"bloat units" threshold, then it seems reasonable to put our finger on
the scales, and do an autovacuum before either threshold is crossed.
I'm not sure how that should work, but I find the idea of interpreting
the "bloat units" creatively/probabilistically appealing.

We're not actually making things up by erring in the direction of
launching an autovacuum worker, because we don't actually know the
number of dead tuples (or whatever) anyway -- we're just recognizing
the very real role of chance and noise. That is, if the "bloat units"
threshold might well not have been crossed due to random chance
(noise, the phase of the moon), why should we defer to random chance?
If we have better information to go on, like the thing with the XID
threshold, why not prefer that? Similarly, if we see that the system
as a whole is not very busy right now, why not consider that, just a
little, if the only downside is that we'll ignore a demonstrably
noise-level signal (from the stats)?

That's the high level intuition behind making "bloat units" a
probability density function, and not just a simple expected value.
Teaching the autovacuum.c scheduler to distinguish signal from noise
could be very valuable, if it enables opportunistic batching of work,
or off-hours work. We don't have to respect noise. The devil is in the
details, of course.

-- 
Peter Geoghegan