Thread: Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Melanie Plageman

Date:

16 December 2024, 18:37:35

On Sun, Dec 15, 2024 at 4:47 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> What's happening is that if the fraction of updated rows is sufficiently
> high, the modified pages are within the SKIP_PAGES_THRESHOLD distance,
> and vacuum switches to seqscan. But with flash storage that happens to
> be harmful - I built master with different SKIP_PAGES_THRESHOLD (1, 2,
> 4, 8 and 16), and collected results for comparison.
>
<--snip-->
> These are timings in second. Timings relative to master branch (with
> SKIP_PAGES_THRESHOLD=32) are
>
>    frac       master-1    master-2   master-4    master-8  master-16
>    -----------------------------------------------------------------
>    0.000001       100%         99%        96%         99%        89%
>    0.00001         85%         84%        87%         89%        94%
>    0.000025        63%         66%        65%         71%        80%
>    0.00005         45%         46%        46%         57%        65%
>    0.000075        37%         38%        39%         50%        69%
>    0.0001          34%         35%        36%         51%        70%
>    0.0002          36%         39%        44%         65%        86%
>    0.0005          88%        108%       132%        157%       144%
>    0.001          219%        274%       269%        203%       129%
>    0.01           114%        103%        99%        101%        99%
>
> So most of this shows significant speedups if we simply disable the
> skipping logic (by setting the threshold to 1). There are some
> regressions when scanning substantial fraction of the table (0.001 means
> ~10% pages are modified).

Thanks for running these benchmarks!

I know you used explicit vacuum, but I was just thinking that with the
default autovacuum_vacuum_scale_factor you'd have to modify 20% of the
tuples to trigger an autovacuum, so depending on the fill factor and
tuple size, it may be normal for 10% of the pages to need scanning by
vacuum. However, we do tell users that 20% is too high anyway.

On a related note, the other day I noticed another negative effect
caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts
with the opportunistic freeze heuristic [1] causing lots of all-frozen
pages to be scanned when checksums are enabled. You can easily end up
with a table that has very fragmented ranges of frozen, all-visible,
and modified pages. In this case, the opportunistic freeze heuristic
bears most of the blame. However, we are not close to coming up with a
replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help.
This wouldn't have affected your results, but it is worth considering
more generally.

> The results from the bigger machine with NVMe seem very similar, except
> that there are no regressions at all. It's always faster to not switch
> to sequential scans.

And we are suggesting to users with bigger, faster machines and SSDs
to tune autovacuum to run more often.

> Obviously, it may not be a good idea to make conclusions based on a
> single (and fairly simplistic) test. But I think it's reasonable to
> conclude the current SKIP_PAGES_THRESHOLD value (i.e. 32) may be a bit
> too high for modern storage. Based on "my" results it seems a value in
> the 4-8 range might be more appropriate.

A range of 4-8 would address some of the negative effects I saw with
the interaction of SKIP_PAGES_THRESHOLD and the opportunistic freeze
heuristic causing all-frozen pages to be read.

> Or perhaps it should be user-tunable? I'm very reluctant to add more and
> more knobs like this, but if we have effective_io_concurrency, isn't
> this a very similar thing? Or maybe it'd be possible to adjust this
> automatically based on existing parameters (esp. rnadom_page_cost and
> effective_io_concurrency)?

I don't think we want to expose it to the user -- given that the plan
in the long run is to get rid of it.

> I do believe most of this applies even to vacuum with stream API patches
> [2], which leave SKIP_PAGES_THRESHOLD in place. In fact, isn't a similar
> logic built into the read stream itself? Isn't vacuumlazy.c the wrong
> level to do this anyway, as it's about I/O pattern - why should vacuum
> worry about that?

I agree about vacuumlazy.c being an unfortunate place to have this
logic. It makes the code there MUCH more complicated.

The read stream API isn't a drop-in replacement for
SKIP_PAGES_THRESHOLD because it only does read combining and
prefetching -- not ensuring sequential ranges are requested. It still
relies on kernel readahead logic. However, with AIO, Andres has
pointed out SKIP_PAGES_THRESHOLD no longer makes sense. With AIO,
multiple reads will be initiated at the same time, so the kernel won't
be getting sequential range requests anyway (even when using buffered
IO). And with direct IO, there is no kernel readahead logic to
confuse.

However, I think your benchmarks show that SKIP_PAGES_THRESHOLD may
not be operating in the way it was intended anymore anyway.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_ZvkxDRtMAhYRTK9N60jfMMVkiQgnEP5m_asYnuvgmQOg%40mail.gmail.com

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Peter Geoghegan

Date:

16 December 2024, 20:31:47

On Mon, Dec 16, 2024 at 10:37 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On a related note, the other day I noticed another negative effect
> caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts
> with the opportunistic freeze heuristic [1] causing lots of all-frozen
> pages to be scanned when checksums are enabled. You can easily end up
> with a table that has very fragmented ranges of frozen, all-visible,
> and modified pages. In this case, the opportunistic freeze heuristic
> bears most of the blame.

Bears most of the blame for what? Significantly reducing the total
amount of WAL written?

> However, we are not close to coming up with a
> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help.
> This wouldn't have affected your results, but it is worth considering
> more generally.

One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes
it more likely that non-aggressive VACUUMs will advance relfrozenxid.
Granted, it's probably not doing a particularly good job at that right
now. But any effort to replace it should account for that.

This is possible by making VACUUM consider the cost of scanning extra
heap pages up-front. If the number of "extra heap pages to be scanned"
to advance relfrozenxid happens to not be very high (or not so high
*relative to the current age(relfrozenxid)*), then pay that cost now,
in the current VACUUM operation. Even if age(relfrozenxid) is pretty
far from the threshold for aggressive mode, if the added cost of
advancing relfrozenxid is still not too high, why wouldn't we just do
it?

I think that aggressive mode is a bad idea more generally. The
behavior around waiting for a cleanup lock (the second
aggressive-mode-influenced behavior) is also a lot more brittle than
it really needs to be, simply because we're not weighing costs and
benefits. There's a bunch of relevant information that could be
applied when deciding what to do (at the level of each individual heap
page that cannot be cleanup locked right away), but we make no effort
to apply that information -- we only care about the static choice of
aggressive vs. non-aggressive there.

--
Peter Geoghegan

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Melanie Plageman

Date:

16 December 2024, 21:49:47

On Mon, Dec 16, 2024 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Dec 16, 2024 at 10:37 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > On a related note, the other day I noticed another negative effect
> > caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts
> > with the opportunistic freeze heuristic [1] causing lots of all-frozen
> > pages to be scanned when checksums are enabled. You can easily end up
> > with a table that has very fragmented ranges of frozen, all-visible,
> > and modified pages. In this case, the opportunistic freeze heuristic
> > bears most of the blame.
>
> Bears most of the blame for what? Significantly reducing the total
> amount of WAL written?

No, I'm talking about the behavior of causing small pockets of
all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD
and are then scanned (even though they are already frozen). What I
describe in that email I cited is that because we freeze
opportunistically when we have or will emit an FPI, and bgwriter will
write out blocks in clocksweep order, we end up with random pockets of
pages getting frozen during/after a checkpoint. Then in the next
vacuum, we end up scanning those all-frozen pages again because the
ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
mostly going to happen for an insert-only workload. I'm not saying
freezing the pages is bad, I'm saying that causing these pockets of
frozen pages leads to scanning all-frozen pages on future vacuums.

> > However, we are not close to coming up with a
> > replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help.
> > This wouldn't have affected your results, but it is worth considering
> > more generally.
>
> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes
> it more likely that non-aggressive VACUUMs will advance relfrozenxid.
> Granted, it's probably not doing a particularly good job at that right
> now. But any effort to replace it should account for that.
>
> This is possible by making VACUUM consider the cost of scanning extra
> heap pages up-front. If the number of "extra heap pages to be scanned"
> to advance relfrozenxid happens to not be very high (or not so high
> *relative to the current age(relfrozenxid)*), then pay that cost now,
> in the current VACUUM operation. Even if age(relfrozenxid) is pretty
> far from the threshold for aggressive mode, if the added cost of
> advancing relfrozenxid is still not too high, why wouldn't we just do
> it?

That's an interesting idea. And it seems like a much more effective
way of getting some relfrozenxid advancement than hoping that the
pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
scanned all unfrozen tuples.

> I think that aggressive mode is a bad idea more generally. The
> behavior around waiting for a cleanup lock (the second
> aggressive-mode-influenced behavior) is also a lot more brittle than
> it really needs to be, simply because we're not weighing costs and
> benefits. There's a bunch of relevant information that could be
> applied when deciding what to do (at the level of each individual heap
> page that cannot be cleanup locked right away), but we make no effort
> to apply that information -- we only care about the static choice of
> aggressive vs. non-aggressive there.

What kind of information? Could you say more?

Andres mentioned the other day that we could set pages all-visible in
the VM even if we don't get the cleanup lock (lazy_scan_noprune())
case. That seems like a good idea.

- Melanie

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Peter Geoghegan

Date:

16 December 2024, 22:17:41

On Mon, Dec 16, 2024 at 1:50 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Then in the next vacuum, we end up scanning those all-frozen pages again because the
> ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
> mostly going to happen for an insert-only workload. I'm not saying
> freezing the pages is bad, I'm saying that causing these pockets of
> frozen pages leads to scanning all-frozen pages on future vacuums.

I guess that it doesn't necessarily matter what component you define
as being at fault here, but FWIW I think that the problem is primarily
SKIP_PAGES_THRESHOLD itself. After all, SKIP_PAGES_THRESHOLD tries to
encourage certain desirable outcomes (namely readahead and
non-aggressive relfrozenxid advancement) without ever really verifying
that that's working out. If VACUUM doesn't truly get readahead (quite
likely), it stills pay quite a high cost in CPU cycles. Similarly, if
VACUUM skips even one all-visible page, we can't expect much of any
benefit for having not skipped any other all-visible pages (whether or
not we can safely advance relfrozenxid at all is an all or nothing
thing).

> That's an interesting idea. And it seems like a much more effective
> way of getting some relfrozenxid advancement than hoping that the
> pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
> scanned all unfrozen tuples.

Sometimes this can be truly perverse. It's possible for a SELECT FOR
SHARE to unset the all-frozen bit, without unsetting the all-visible
bit. So all that it takes is one such SELECT FOR SHARE against one
random tuple in an originally-all-frozen heap page inside a large
grouping of all-frozen heap pages. That's almost certainly enough to
obstruct non-aggressive relfrozenxid advancement _for the entire
table_. ISTM that it's just criminal to miss out on non-aggressive
relfrozenxid advancement because of some tiny issue such as that. And
so SKIP_PAGES_THRESHOLD should be replaced by something that
specifically has relfrozenxid as a goal of reading all-visible pages.

Just how well this works out in practice will be very workload
dependent. But there are workloads/tables where it works quite well:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes

> > I think that aggressive mode is a bad idea more generally. The
> > behavior around waiting for a cleanup lock (the second
> > aggressive-mode-influenced behavior) is also a lot more brittle than
> > it really needs to be, simply because we're not weighing costs and
> > benefits. There's a bunch of relevant information that could be
> > applied when deciding what to do (at the level of each individual heap
> > page that cannot be cleanup locked right away), but we make no effort
> > to apply that information -- we only care about the static choice of
> > aggressive vs. non-aggressive there.
>
> What kind of information? Could you say more?

Waiting or not waiting doesn't have to be an either/or choice. For
example, we could retry a few times, with a backoff, if and only if it
seemed to make sense. Testing that I did a few years ago showed that
that worked rather well.

Why should we necessarily have to advance relfrozenxid exactly up to
FreezeLimit during every aggressive VACUUM? Surely the picture over
time and across multiple VACUUM operations is what matters most? At
the very least, we should have an independent XID cutoff for "must
advance relfrozenxid up to here, no matter what" -- we should just
reuse FreezeLimit to control that behavior. We might very well "try
quite hard to advance relfrozenxid to a value >= FreezeLimit" -- we
just don't have to do it no matter what the cost is. There is a huge
practical difference between "try quite hard" (e.g., retry the cleanup
lock acquisition 3 times, with a sleep between each) and "try
infinitely hard" (i.e., wait for a cleanup lock indefinitely).

VACUUM is the only thing that freezes. Right now we'll sometimes wait
for a cleanup lock forever, because "we have to freeze". But waiting
like that necessarily implies that we cannot freeze any other page in
the table. That might actually be the least worst thing that VACUUM
can do, but only in the extreme case where we've really put it off for
as long as we possibly could (by not waiting forever in prior VACUUM
operations). There is actually a lot of potential benefit from
"kicking the can down the road", if we go about it intelligently. Just
waiting forever is a pretty horrible strategy, and so it should only
be the strategy of last resort (we might reasonably be able to dump a
log report about it in a world where it's truly the strategy of last
resort).

> Andres mentioned the other day that we could set pages all-visible in
> the VM even if we don't get the cleanup lock (lazy_scan_noprune())
> case. That seems like a good idea.

Maybe it's a good idea, but right now it poses a similar risk to my
scenario involving a random, isolated SELECT FOR SHARE that happens to
affect some random tuple on a cold/frozen page. Of course, this
wouldn't be all that hard to fix.

--
Peter Geoghegan

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Peter Geoghegan

Date:

16 December 2024, 22:37:36

On Mon, Dec 16, 2024 at 2:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Maybe it's a good idea, but right now it poses a similar risk to my
> scenario involving a random, isolated SELECT FOR SHARE that happens to
> affect some random tuple on a cold/frozen page. Of course, this
> wouldn't be all that hard to fix.

BTW, one alternative to a "retry cleanup lock acquisition 3 times"
heuristic is a version of freezing/pruning that doesn't require a
cleanup lock at all.

The only reason that we require a cleanup lock is to make it safe to
defragment the page, to free tuple space. Nothing stops you from
inventing a variant of pruning/freezing that works just like regular
pruning/freezing, but without attempting to free up tuple space --
thus obviating the need for a cleanup lock. This process could still
remove dead tuples by (say) setting their xmin to InvalidTransactionId
-- that ought to still be safe.

I don't think that problems with acquiring cleanup locks happen all
that often -- they're probably very rare. But when they do happen they
can be extremely serious -- they can even cause outages. And so
there's a lot to be said for a design that removes those sorts of
risks (or that at least significantly ameliorates them). It gives you
the freedom to come up with better high-level designs that don't have
to worry about not acquiring cleanup locks.

--
Peter Geoghegan

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Tomas Vondra

Date:

17 December 2024, 17:11:25


On 12/16/24 19:49, Melanie Plageman wrote:
> On Mon, Dec 16, 2024 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
>>
>> On Mon, Dec 16, 2024 at 10:37 AM Melanie Plageman
>> <melanieplageman@gmail.com> wrote:
>>> On a related note, the other day I noticed another negative effect
>>> caused in part by SKIP_PAGES_THRESHOLD. SKIP_PAGES_THRESHOLD interacts
>>> with the opportunistic freeze heuristic [1] causing lots of all-frozen
>>> pages to be scanned when checksums are enabled. You can easily end up
>>> with a table that has very fragmented ranges of frozen, all-visible,
>>> and modified pages. In this case, the opportunistic freeze heuristic
>>> bears most of the blame.
>>
>> Bears most of the blame for what? Significantly reducing the total
>> amount of WAL written?
> 
> No, I'm talking about the behavior of causing small pockets of
> all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD
> and are then scanned (even though they are already frozen). What I
> describe in that email I cited is that because we freeze
> opportunistically when we have or will emit an FPI, and bgwriter will
> write out blocks in clocksweep order, we end up with random pockets of
> pages getting frozen during/after a checkpoint. Then in the next
> vacuum, we end up scanning those all-frozen pages again because the
> ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
> mostly going to happen for an insert-only workload. I'm not saying
> freezing the pages is bad, I'm saying that causing these pockets of
> frozen pages leads to scanning all-frozen pages on future vacuums.
> 

Yeah, this interaction between the components is not great :-( But can
we think of a way to reduce the fragmentation? What would need to change?

I don't think bgwriter can help much - it's mostly oblivious to the
contents of the buffer, I don't think it could consider stuff like this
when deciding what to evict.

Maybe the freezing code could check how many of the nearby pages are
frozen, and consider that together with the FPI write?

>>> However, we are not close to coming up with a
>>> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help.
>>> This wouldn't have affected your results, but it is worth considering
>>> more generally.
>>
>> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes
>> it more likely that non-aggressive VACUUMs will advance relfrozenxid.
>> Granted, it's probably not doing a particularly good job at that right
>> now. But any effort to replace it should account for that.
>>

I don't follow. How could non-aggressive VACUUM advance relfrozenxid,
ever? I mean, if it doesn't guarantee freezing all pages, how could it?

>> This is possible by making VACUUM consider the cost of scanning extra
>> heap pages up-front. If the number of "extra heap pages to be scanned"
>> to advance relfrozenxid happens to not be very high (or not so high
>> *relative to the current age(relfrozenxid)*), then pay that cost now,
>> in the current VACUUM operation. Even if age(relfrozenxid) is pretty
>> far from the threshold for aggressive mode, if the added cost of
>> advancing relfrozenxid is still not too high, why wouldn't we just do
>> it?
> 
> That's an interesting idea. And it seems like a much more effective
> way of getting some relfrozenxid advancement than hoping that the
> pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
> scanned all unfrozen tuples.
> 

I agree it might be useful to formulate this as a "costing" problem, not
just in the context of a single vacuum, but for the overall maintenance
overhead - essentially accepting the vacuum gets slower, in exchange for
lower cost of maintenance later.

But I think that (a) is going to be fairly complex, because how do you
cost the future vacuum?, and (b) is somewhat misses my point that on
modern NVMe SSD storage (SKIP_PAGES_THRESHOLD > 1) doesn't seem to be a
win *ever*.

So why shouldn't we reduce the SKIP_PAGES_THRESHOLD value (or perhaps
make it configurable)? We can still do the other stuff (decide how
aggressively to free stuff etc.) independently of that.


regards

-- 
Tomas Vondra

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Robert Haas

Date:

17 December 2024, 18:50:56

On Mon, Dec 16, 2024 at 2:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The only reason that we require a cleanup lock is to make it safe to
> defragment the page, to free tuple space. Nothing stops you from
> inventing a variant of pruning/freezing that works just like regular
> pruning/freezing, but without attempting to free up tuple space --
> thus obviating the need for a cleanup lock. This process could still
> remove dead tuples by (say) setting their xmin to InvalidTransactionId
> -- that ought to still be safe.

I think this amounts to inventing a new way to mark a tuple as dead,
but I don't think we need to do that. We could just mark the line
pointer LP_DEAD without actually touching the space that contains the
tuple. At least, I think that would work. The only question in my mind
is whether whatever we propose to do here would violate the locking
rules documented in src/backend/storage/buffer/README. In my opinion,
those rules are a bit vaguely worded in some places, but I interpret
#1 and #2 as meaning that you can't look at the line pointer array
without some kind of content lock, so an exclusive content lock should
be good enough to mark a line pointer dead as long as you don't
relocate any tuples.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Peter Geoghegan

Date:

17 December 2024, 19:02:12

On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote:
> I don't follow. How could non-aggressive VACUUM advance relfrozenxid,
> ever? I mean, if it doesn't guarantee freezing all pages, how could it?

Although it's very workload dependent, it still happens all the time.
Just look at the autovacuum log output from almost any autovacuum that
runs when the regression tests run. Or look at the autovacuum output
for the small pgbench tables.

In general, relfrozenxid simply tracks the oldest possible extant XID
in the table. VACUUM doesn't necessarily need to do any freezing to
advance relfrozenxid/relminmxid. But VACUUM *must* exhaustively scan
every heap page that could possibly contain an old XID in order to be
able to advance relfrozenxid/relminmxid safely.

> > That's an interesting idea. And it seems like a much more effective
> > way of getting some relfrozenxid advancement than hoping that the
> > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
> > scanned all unfrozen tuples.

> But I think that (a) is going to be fairly complex, because how do you
> cost the future vacuum?, and (b) is somewhat misses my point that on
> modern NVMe SSD storage (SKIP_PAGES_THRESHOLD > 1) doesn't seem to be a
> win *ever*.

I am not suggesting that the readahead argument for
SKIP_PAGES_THRESHOLD is really valid. I think that the relfrozenxid
argument is the only one that makes any sense. Clearly both arguments
justified the introduction of SKIP_PAGES_THRESHOLD, after the earliest
work on the visibility map back in 2009 -- see the commit message for
bf136cf6.

In short, I am envisaging a design that decides whether or not it'll
advance relfrozenxid based on both the costs and the benefits/need.
Under this scheme, VACUUM would either scan exactly all
all-visible-not-all-frozen pages, or scan none at all. This decision
would be almost completely independent of the decision to freeze or
not freeze pages (it'd be loosely related because FreezeLimit can
never be more than autovacuum_freeze_max_age/2 XIDs in age). Then we'd
be free to just get rid of SKIP_PAGES_THRESHOLD, which presumably
isn't doing much for readahead.

--
Peter Geoghegan

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Peter Geoghegan

Date:

17 December 2024, 19:10:56

On Tue, Dec 17, 2024 at 10:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think this amounts to inventing a new way to mark a tuple as dead,
> but I don't think we need to do that. We could just mark the line
> pointer LP_DEAD without actually touching the space that contains the
> tuple. At least, I think that would work.

You might well be right about that. HOT chains might make things more
complicated (a heap-only tuple can only be set LP_UNUSED), but we
could probably figure that out if we had to.

I'm not entirely sure that this would be worth the effort. I believe
that having to wait for a cleanup lock is rare in general. On the
other hand, it'd be convenient if the high-level design of
VACUUM/autovacuum scheduling could freely ignore the possibility of
having to wait for a cleanup lock.

> The only question in my mind
> is whether whatever we propose to do here would violate the locking
> rules documented in src/backend/storage/buffer/README. In my opinion,
> those rules are a bit vaguely worded in some places, but I interpret
> #1 and #2 as meaning that you can't look at the line pointer array
> without some kind of content lock, so an exclusive content lock should
> be good enough to mark a line pointer dead as long as you don't
> relocate any tuples.

That is also my interpretation.

--
Peter Geoghegan

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Melanie Plageman

Date:

17 December 2024, 19:44:44

On Mon, Dec 16, 2024 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Dec 16, 2024 at 1:50 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Then in the next vacuum, we end up scanning those all-frozen pages again because the
> > ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
> > mostly going to happen for an insert-only workload. I'm not saying
> > freezing the pages is bad, I'm saying that causing these pockets of
> > frozen pages leads to scanning all-frozen pages on future vacuums.
>
> I guess that it doesn't necessarily matter what component you define
> as being at fault here, but FWIW I think that the problem is primarily
> SKIP_PAGES_THRESHOLD itself. After all, SKIP_PAGES_THRESHOLD tries to
> encourage certain desirable outcomes (namely readahead and
> non-aggressive relfrozenxid advancement) without ever really verifying
> that that's working out. If VACUUM doesn't truly get readahead (quite
> likely), it stills pay quite a high cost in CPU cycles. Similarly, if
> VACUUM skips even one all-visible page, we can't expect much of any
> benefit for having not skipped any other all-visible pages (whether or
> not we can safely advance relfrozenxid at all is an all or nothing
> thing).

Right. ISTM that coming up with a better strategy to enable
relfrozenxid advancement than SKIP_PAGES_THRESHOLD is something
everyone agrees would be worthwhile.

> > That's an interesting idea. And it seems like a much more effective
> > way of getting some relfrozenxid advancement than hoping that the
> > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
> > scanned all unfrozen tuples.
>
> Sometimes this can be truly perverse. It's possible for a SELECT FOR
> SHARE to unset the all-frozen bit, without unsetting the all-visible
> bit. So all that it takes is one such SELECT FOR SHARE against one
> random tuple in an originally-all-frozen heap page inside a large
> grouping of all-frozen heap pages. That's almost certainly enough to
> obstruct non-aggressive relfrozenxid advancement _for the entire
> table_. ISTM that it's just criminal to miss out on non-aggressive
> relfrozenxid advancement because of some tiny issue such as that. And
> so SKIP_PAGES_THRESHOLD should be replaced by something that
> specifically has relfrozenxid as a goal of reading all-visible pages.
>
> Just how well this works out in practice will be very workload
> dependent. But there are workloads/tables where it works quite well:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes

I did look at the wiki page a bit. But one thing I didn't quite grasp
is how you are proposing to measure the costs/benefits of scanning all
all-visible pages. When you first mentioned this, I imagined you would
use visibilitymap_count() at the beginning of the vacuum and consider
scanning all the all-visible pages if there aren't many (when compared
to the total number of pages needing scanning). But then, I'm not sure
I see how that lets you advance relfrozenxid more often. It seems like
the all-visible pages you would scan this way would be younger and
less likely to be required to freeze (per freeze limit), so you'd end
up just uselessly scanning them.

> Why should we necessarily have to advance relfrozenxid exactly up to
> FreezeLimit during every aggressive VACUUM? Surely the picture over
> time and across multiple VACUUM operations is what matters most? At
> the very least, we should have an independent XID cutoff for "must
> advance relfrozenxid up to here, no matter what" -- we should just
> reuse FreezeLimit to control that behavior. We might very well "try
> quite hard to advance relfrozenxid to a value >= FreezeLimit" -- we
> just don't have to do it no matter what the cost is. There is a huge
> practical difference between "try quite hard" (e.g., retry the cleanup
> lock acquisition 3 times, with a sleep between each) and "try
> infinitely hard" (i.e., wait for a cleanup lock indefinitely).

I got a bit confused here. Do you mean that because we call
lazy_scan_noprune() and visit tuples this way, we can still advance
the relfrozenxid to the oldest unfrozen xid value just based on what
we see in lazy_scan_noprune() (i.e. even if we don't get the cleanup
lock)?

- Melanie

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Melanie Plageman

Date:

17 December 2024, 20:06:28

On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote:
>
>
>
> On 12/16/24 19:49, Melanie Plageman wrote:
>
> > No, I'm talking about the behavior of causing small pockets of
> > all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD
> > and are then scanned (even though they are already frozen). What I
> > describe in that email I cited is that because we freeze
> > opportunistically when we have or will emit an FPI, and bgwriter will
> > write out blocks in clocksweep order, we end up with random pockets of
> > pages getting frozen during/after a checkpoint. Then in the next
> > vacuum, we end up scanning those all-frozen pages again because the
> > ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
> > mostly going to happen for an insert-only workload. I'm not saying
> > freezing the pages is bad, I'm saying that causing these pockets of
> > frozen pages leads to scanning all-frozen pages on future vacuums.
> >
>
> Yeah, this interaction between the components is not great :-( But can
> we think of a way to reduce the fragmentation? What would need to change?

Well reducing SKIP_PAGES_THRESHOLD would help. And unfortunately we do
not know if the skippable pages are all-frozen without extra
visibilitymap_get_status() calls -- so we can't decide to avoid
scanning ranges of skippable pages because they are frozen.

> I don't think bgwriter can help much - it's mostly oblivious to the
> contents of the buffer, I don't think it could consider stuff like this
> when deciding what to evict.

Agreed.

> Maybe the freezing code could check how many of the nearby pages are
> frozen, and consider that together with the FPI write?

That's an interesting idea. We wouldn't have any guaranteed info
because we only have a lock on the page we are considering freezing.
But we could keep track of the length of a run of pages we are
freezing and opportunistically freeze pages that don't require
freezing if they follow one or more pages requiring freezing. But I
don't know how much more this buys us than removing
SKIP_PAGES_THRESHOLD. Since it would "fix" the fragmentation, perhaps
it makes larger future vacuum reads possible. But I wonder how much
benefit it would be vs complexity.

> >>> However, we are not close to coming up with a
> >>> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help.
> >>> This wouldn't have affected your results, but it is worth considering
> >>> more generally.
> >>
> >> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes
> >> it more likely that non-aggressive VACUUMs will advance relfrozenxid.
> >> Granted, it's probably not doing a particularly good job at that right
> >> now. But any effort to replace it should account for that.
> >>
>
> I don't follow. How could non-aggressive VACUUM advance relfrozenxid,
> ever? I mean, if it doesn't guarantee freezing all pages, how could it?

It may, coincidentally, not skip any all-visible pages. Peter points
out that this happens all the time for small tables, but wouldn't the
overhead of an aggressive vacuum be barely noticeable for small
tables? It seems like there is little cost to waiting.

> >> This is possible by making VACUUM consider the cost of scanning extra
> >> heap pages up-front. If the number of "extra heap pages to be scanned"
> >> to advance relfrozenxid happens to not be very high (or not so high
> >> *relative to the current age(relfrozenxid)*), then pay that cost now,
> >> in the current VACUUM operation. Even if age(relfrozenxid) is pretty
> >> far from the threshold for aggressive mode, if the added cost of
> >> advancing relfrozenxid is still not too high, why wouldn't we just do
> >> it?
> >
> > That's an interesting idea. And it seems like a much more effective
> > way of getting some relfrozenxid advancement than hoping that the
> > pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
> > scanned all unfrozen tuples.
> >
>
> I agree it might be useful to formulate this as a "costing" problem, not
> just in the context of a single vacuum, but for the overall maintenance
> overhead - essentially accepting the vacuum gets slower, in exchange for
> lower cost of maintenance later.

Yes, that costing sounds like a big research and benchmarking project
on its own.

> But I think that (a) is going to be fairly complex, because how do you
> cost the future vacuum?, and (b) is somewhat misses my point that on
> modern NVMe SSD storage (SKIP_PAGES_THRESHOLD > 1) doesn't seem to be a
> win *ever*.
>
> So why shouldn't we reduce the SKIP_PAGES_THRESHOLD value (or perhaps
> make it configurable)? We can still do the other stuff (decide how
> aggressively to free stuff etc.) independently of that.

I think your tests show SKIP_PAGES_THRESHOLD has dubious if any
benefit related to readahead. But the question is if we care about it
for advancing relfrozenxid for small tables.

- Melanie

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Robert Haas

Date:

17 December 2024, 20:52:26

On Tue, Dec 17, 2024 at 12:06 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think your tests show SKIP_PAGES_THRESHOLD has dubious if any
> benefit related to readahead. But the question is if we care about it
> for advancing relfrozenxid for small tables.

It seems like relfrozenxid advancement is really only a problem for
big tables. If the table is small, the eventual aggressive vacuum
doesn't cost that much.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Tomas Vondra

Date:

17 December 2024, 21:46:39

On 12/17/24 18:06, Melanie Plageman wrote:
> On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote:
>>
>>
>>
>> On 12/16/24 19:49, Melanie Plageman wrote:
>>
>>> No, I'm talking about the behavior of causing small pockets of
>>> all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD
>>> and are then scanned (even though they are already frozen). What I
>>> describe in that email I cited is that because we freeze
>>> opportunistically when we have or will emit an FPI, and bgwriter will
>>> write out blocks in clocksweep order, we end up with random pockets of
>>> pages getting frozen during/after a checkpoint. Then in the next
>>> vacuum, we end up scanning those all-frozen pages again because the
>>> ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
>>> mostly going to happen for an insert-only workload. I'm not saying
>>> freezing the pages is bad, I'm saying that causing these pockets of
>>> frozen pages leads to scanning all-frozen pages on future vacuums.
>>>
>>
>> Yeah, this interaction between the components is not great :-( But can
>> we think of a way to reduce the fragmentation? What would need to change?
> 
> Well reducing SKIP_PAGES_THRESHOLD would help.

How does SKIP_PAGES_THRESHOLD change the fragmentation? I think that's
the side that's affected by the fragmentation, but it's really due to
the eager freezing / bgwriter evictions, etc. If the threshold is set to
1 (i.e. to always skip), that just lowers the impact, but the relation
is still as fragmented as before, no?

> And unfortunately we do
> not know if the skippable pages are all-frozen without extra
> visibilitymap_get_status() calls -- so we can't decide to avoid
> scanning ranges of skippable pages because they are frozen.
> 

I may be missing something, but doesn't find_next_unskippable_block()
already get those bits? In fact, it even checks VISIBILITYMAP_ALL_FROZEN
but only for aggressive vacuum. But even if that wasn't the case, isn't
checking the VM likely much cheaper than vacuuming the heap page?

Also, I don't quite see why would this information help with reducing
the fragmentation? Can you explain?

>> I don't think bgwriter can help much - it's mostly oblivious to the
>> contents of the buffer, I don't think it could consider stuff like this
>> when deciding what to evict.
> 
> Agreed.
> 
>> Maybe the freezing code could check how many of the nearby pages are
>> frozen, and consider that together with the FPI write?
> 
> That's an interesting idea. We wouldn't have any guaranteed info
> because we only have a lock on the page we are considering freezing.
> But we could keep track of the length of a run of pages we are
> freezing and opportunistically freeze pages that don't require
> freezing if they follow one or more pages requiring freezing.

I don't think we need a "guaranteed" information - a heuristics that's
correct most of the time (say, >90%?) ought to be good enough. I mean,
it has to be, because we'll never get a rule that's correct 100%. So
even just looking at a batch of pages in VM should be enough, no?

> But I don't know how much more this buys us than removing
> SKIP_PAGES_THRESHOLD. Since it would "fix" the fragmentation, perhaps
> it makes larger future vacuum reads possible. But I wonder how much
> benefit it would be vs complexity.
> 

I think that depends on which cost we're talking about. If we only talk
about the efficiency of a single vacuum, then it probably does not help
very much. I mean, if we assume the relation is already fragmented, then
it seems to be cheaper to vacuum just the pages that need it (as if with
SKIP_PAGES_THRESHOLD=1).

But if we're talking about long-time benefits, in reducing the amount of
freezing needed overall, maybe it'd be a win? I don't know.

>>>>> However, we are not close to coming up with a
>>>>> replacement heuristic, so removing SKIP_PAGES_THRESHOLD would help.
>>>>> This wouldn't have affected your results, but it is worth considering
>>>>> more generally.
>>>>
>>>> One of the reasons why we have SKIP_PAGES_THRESHOLD is that it makes
>>>> it more likely that non-aggressive VACUUMs will advance relfrozenxid.
>>>> Granted, it's probably not doing a particularly good job at that right
>>>> now. But any effort to replace it should account for that.
>>>>
>>
>> I don't follow. How could non-aggressive VACUUM advance relfrozenxid,
>> ever? I mean, if it doesn't guarantee freezing all pages, how could it?
> 
> It may, coincidentally, not skip any all-visible pages. Peter points
> out that this happens all the time for small tables, but wouldn't the
> overhead of an aggressive vacuum be barely noticeable for small
> tables? It seems like there is little cost to waiting.
> 

Yeah, that's kinda my point / confusion. For it to help we would have to
not skip any pages, but for large tables that seems quite unlikely
(because it only takes one table that gets skipped, and there are many
opportunities). And for small tables, I think it doesn't matter that
much, because even aggressive vacuum is cheap.

>>>> This is possible by making VACUUM consider the cost of scanning extra
>>>> heap pages up-front. If the number of "extra heap pages to be scanned"
>>>> to advance relfrozenxid happens to not be very high (or not so high
>>>> *relative to the current age(relfrozenxid)*), then pay that cost now,
>>>> in the current VACUUM operation. Even if age(relfrozenxid) is pretty
>>>> far from the threshold for aggressive mode, if the added cost of
>>>> advancing relfrozenxid is still not too high, why wouldn't we just do
>>>> it?
>>>
>>> That's an interesting idea. And it seems like a much more effective
>>> way of getting some relfrozenxid advancement than hoping that the
>>> pages you scan due to SKIP_PAGES_THRESHOLD end up being enough to have
>>> scanned all unfrozen tuples.
>>>
>>
>> I agree it might be useful to formulate this as a "costing" problem, not
>> just in the context of a single vacuum, but for the overall maintenance
>> overhead - essentially accepting the vacuum gets slower, in exchange for
>> lower cost of maintenance later.
> 
> Yes, that costing sounds like a big research and benchmarking project
> on its own.
> 

True. I don't know if it makes sense to try to construct a detailed /
accurate cost model for this. I meant more of a general model showing
the general relationship between the amount of work that has to happen
in different places.

regards

-- 
Tomas Vondra

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Tomas Vondra

Date:

17 December 2024, 21:50:54

On 12/17/24 18:52, Robert Haas wrote:
> On Tue, Dec 17, 2024 at 12:06 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>> I think your tests show SKIP_PAGES_THRESHOLD has dubious if any
>> benefit related to readahead. But the question is if we care about it
>> for advancing relfrozenxid for small tables.

Yeah, although I'm a bit hesitant to draw such clear conclusion from one
set of benchmarks on one machine (or two, but both with flash). I would
not be surprised if the results were less clear on other types of
storage (say, something like EBS).

> 
> It seems like relfrozenxid advancement is really only a problem for
> big tables. If the table is small, the eventual aggressive vacuum
> doesn't cost that much.
> 

Yeah, I agree with this.

regards

-- 
Tomas Vondra

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Peter Geoghegan

Date:

17 December 2024, 22:34:35

On Tue, Dec 17, 2024 at 11:44 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I did look at the wiki page a bit. But one thing I didn't quite grasp
> is how you are proposing to measure the costs/benefits of scanning all
> all-visible pages. When you first mentioned this, I imagined you would
> use visibilitymap_count() at the beginning of the vacuum and consider
> scanning all the all-visible pages if there aren't many (when compared
> to the total number of pages needing scanning).

FWIW, the way that it actually worked in my abandoned patch set was
that VACUUM would always work off of a copy of the VM, taken at the
start -- a "VM snapshot" (I had an ambition that this VM snapshot
concept would eventually be used to make VACUUM suspendable, since it
was easy to serialize to disk in a temp file).

Working off a VM snapshot gave VACUUM a fixed idea about which pages
it needs to scan -- the cost model worked off of a count (of
all-visible vs all-frozen pages) gathered when the VM snapshot was
first established. There could never be any repeat VM accesses, since
VACUUM just worked off of the snapshot for everything VM related other
than setting VM bits.

I'm not saying that you should do this exact thing. I'm just providing context.

> But then, I'm not sure
> I see how that lets you advance relfrozenxid more often. It seems like
> the all-visible pages you would scan this way would be younger and
> less likely to be required to freeze (per freeze limit), so you'd end
> up just uselessly scanning them.

I think that it would make sense to have a low "extra pages to scan"
threshold of perhaps 5% of rel_pages, so that we automatically scan
those pages regardless of the current value of age(relfrozenxid). It
could be helpful with advancing relminmxid, in particular.

There is a need to work out how that varies as age(relfrozenxid) gets
closer to the cutoff for anti-wraparound autovacuum. The way I did
that in the old patch set involved ramping up from 5% of rel_pages
once age(relfrozenxid) reached the half-way point (half-way to
requiring an anti-wraparound autovacuum). Maybe these heuristics
aren't the best, but they seem like roughly the right idea to me. A
conservative version might be to do something like this, without any
of the ramp-up behavior (don't ramp up based on  the current
age(relfrozenxid)).

> > Why should we necessarily have to advance relfrozenxid exactly up to
> > FreezeLimit during every aggressive VACUUM? Surely the picture over
> > time and across multiple VACUUM operations is what matters most? At
> > the very least, we should have an independent XID cutoff for "must
> > advance relfrozenxid up to here, no matter what" -- we should just
> > reuse FreezeLimit to control that behavior. We might very well "try
> > quite hard to advance relfrozenxid to a value >= FreezeLimit" -- we
> > just don't have to do it no matter what the cost is. There is a huge
> > practical difference between "try quite hard" (e.g., retry the cleanup
> > lock acquisition 3 times, with a sleep between each) and "try
> > infinitely hard" (i.e., wait for a cleanup lock indefinitely).
>
> I got a bit confused here. Do you mean that because we call
> lazy_scan_noprune() and visit tuples this way, we can still advance
> the relfrozenxid to the oldest unfrozen xid value just based on what
> we see in lazy_scan_noprune() (i.e. even if we don't get the cleanup
> lock)?

What I meant is that the high-level rule that says that aggressive
VACUUM must advance relfrozenxid to a value >= FreezeLimit is fairly
arbitrary, and not very principled. It's an artefact of the way that
FreezeLimit used to work -- nothing more.

It does make sense to have a rule roughly like that, of course, but I
think that it should be much looser. VACUUM could be more conservative
about advancing relfrozenxid on average (i.e. it could advance
relfrozenxid to a value more recent than FreezeLimit, and advance
relfrozenxid in more individual VACUUM operations), while at the same
time not making an absolute iron-clad guarantee about how much
relfrozenxid has to be advanced within any given VACUUM. In short,
VACUUM would promise less but deliver more.

A part of that might be to teach lazy_scan_noprune to "try a little
harder" when it made sense to wait a little while (but not forever)
for a cleanup lock. Alternatively, as discussed with Robert today, it
might be possible to freeze (and "prune") without requiring a cleanup
lock in a way that was sufficient to always be able to advance
relfrozenxid without hindrance from cursors with conflicting buffer
pins and whatnot.

--
Peter Geoghegan

Re: Maybe we should reduce SKIP_PAGES_THRESHOLD a bit?

From

Melanie Plageman

Date:

17 December 2024, 22:49:46

On Tue, Dec 17, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 12/17/24 18:06, Melanie Plageman wrote:
> > On Tue, Dec 17, 2024 at 9:11 AM Tomas Vondra <tomas@vondra.me> wrote:
> >>
> >>
> >>
> >> On 12/16/24 19:49, Melanie Plageman wrote:
> >>
> >>> No, I'm talking about the behavior of causing small pockets of
> >>> all-frozen pages which end up being smaller than SKIP_PAGES_THRESHOLD
> >>> and are then scanned (even though they are already frozen). What I
> >>> describe in that email I cited is that because we freeze
> >>> opportunistically when we have or will emit an FPI, and bgwriter will
> >>> write out blocks in clocksweep order, we end up with random pockets of
> >>> pages getting frozen during/after a checkpoint. Then in the next
> >>> vacuum, we end up scanning those all-frozen pages again because the
> >>> ranges of frozen pages are smaller than SKIP_PAGES_THRESHOLD. This is
> >>> mostly going to happen for an insert-only workload. I'm not saying
> >>> freezing the pages is bad, I'm saying that causing these pockets of
> >>> frozen pages leads to scanning all-frozen pages on future vacuums.
> >>>
> >>
> >> Yeah, this interaction between the components is not great :-( But can
> >> we think of a way to reduce the fragmentation? What would need to change?
> >
> > Well reducing SKIP_PAGES_THRESHOLD would help.
>
> How does SKIP_PAGES_THRESHOLD change the fragmentation? I think that's
> the side that's affected by the fragmentation, but it's really due to
> the eager freezing / bgwriter evictions, etc. If the threshold is set to
> 1 (i.e. to always skip), that just lowers the impact, but the relation
> is still as fragmented as before, no?

Yep, exactly. It doesn't help with fragmentation. It just helps us not
scan all-frozen pages. The question is whether or not the
fragmentation on its own matters. I think it would be better if we
didn't have it -- we could potentially do larger reads, for example,
if we have one continuous block of pages that are not all-frozen (most
likely when using the read stream API).

> > And unfortunately we do
> > not know if the skippable pages are all-frozen without extra
> > visibilitymap_get_status() calls -- so we can't decide to avoid
> > scanning ranges of skippable pages because they are frozen.
> >
>
> I may be missing something, but doesn't find_next_unskippable_block()
> already get those bits? In fact, it even checks VISIBILITYMAP_ALL_FROZEN
> but only for aggressive vacuum. But even if that wasn't the case, isn't
> checking the VM likely much cheaper than vacuuming the heap page?

find_next_unskippable_block() has them, but then if we do decide to
skip a range of pages, back in heap_vac_scan_next_block() where we
decide whether or not to skip a range using SKIP_PAGES_THRESHOLD, we
know that the pages in the range are all-visible (otherwise they
wouldn't be skippable) but we no longer know which of them were
all-frozen.

> >> Maybe the freezing code could check how many of the nearby pages are
> >> frozen, and consider that together with the FPI write?
> >
> > That's an interesting idea. We wouldn't have any guaranteed info
> > because we only have a lock on the page we are considering freezing.
> > But we could keep track of the length of a run of pages we are
> > freezing and opportunistically freeze pages that don't require
> > freezing if they follow one or more pages requiring freezing.
>
> I don't think we need a "guaranteed" information - a heuristics that's
> correct most of the time (say, >90%?) ought to be good enough. I mean,
> it has to be, because we'll never get a rule that's correct 100%. So
> even just looking at a batch of pages in VM should be enough, no?
>
> > But I don't know how much more this buys us than removing
> > SKIP_PAGES_THRESHOLD. Since it would "fix" the fragmentation, perhaps
> > it makes larger future vacuum reads possible. But I wonder how much
> > benefit it would be vs complexity.
> >
>
> I think that depends on which cost we're talking about. If we only talk
> about the efficiency of a single vacuum, then it probably does not help
> very much. I mean, if we assume the relation is already fragmented, then
> it seems to be cheaper to vacuum just the pages that need it (as if with
> SKIP_PAGES_THRESHOLD=1).
>
> But if we're talking about long-time benefits, in reducing the amount of
> freezing needed overall, maybe it'd be a win? I don't know.

Yea, it just depends on whether or not the pages we freeze for this
reason are likely to stay frozen.
I think I misspoke in saying we want to freeze pages next to pages
requiring freezing. What we really want to do is freeze pages next to
pages that are being opportunistically frozen -- because those are the
ones that are creating the fragmentation. But, then where do you draw
the line? You won't know if you are creating lots of random holes
until after you've skipped opportunistically freezing some pages --
and by then it's too late.

- Melanie