Thread: vacuum_cost_page_miss default value and modern hardware

vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

27 December 2020, 20:16:50

vacuum_cost_page_miss has a default value of 10, while
vacuum_cost_page_dirty has a default value of 20. This has been the
case since cost-based delays were introduced by commit f425b605f4e
back in 2004. The obvious implication is that dirtying a page is on
average only twice as expensive as a single shared_buffers miss (that
doesn't dirty the page). While that might have made sense back in
2004, when magnetic disks were the norm, it seems questionable now.

The trend these days is that the total number of dirty pages is the
limiting factor for OLTP workloads. This is a broad trend among all
disk-based RDBMSs with an ARIES style design. It is more or less a
result of long term trends in main memory size scaling and flash
storage. It's rare for OLTP workloads to be truly I/O bound, and yet
the backpressure from page cleaning/checkpointing becomes a
bottleneck. In short, writes are way more expensive than reads -- the
*relative* cost of writes has increased significantly (our use of the
OS FS cache makes this even worse).

I suspect that this trend will become even more important for Postgres
in the coming years, but that's not what I want to talk about right
now. I just want to talk about vacuum_cost_page_miss on this thread.

Simply decreasing vacuum_cost_page_dirty seems like a low risk way of
making the VACUUM costing more useful within autovacuum workers.
Halving vacuum_cost_page_dirty to 5 would be a good start, though I
think that a value as low as 2 would be better. That would make it
only 2x vacuum_cost_page_hit's default (i.e 2x the cost of processing
a page that is in shared_buffers but did not need to be dirtied),
which seems sensible to me when considered in the context in which the
value is actually applied (and not some abstract theoretical context).
There are a few reasons why this seems like a good idea now:

* Throttling/delaying VACUUM is only useful as a way of smoothing the
impact on production queries, which is an important goal, but
currently we don't discriminate against the cost that we really should
keep under control (dirtying new pages within VACUUM) very well.

This is due to the aforementioned trends, the use of a strategy ring
buffer by VACUUM, the fact that indexes are usually vacuumed in
sequential physical order these days, and many other things that were
not a factor in 2004.

* There is a real downside to throttling VACUUM unnecessarily, and the
effects are *non-linear*. On a large table, the oldest xmin cutoff may
become very old by the time we're only (say) half way through the
initial table scan in lazy_scan_heap(). There may be relatively little
work to do because most of the dead tuples won't be before the oldest
xmin cutoff by that time (VACUUM just cannot keep up). Excessive
throttling for simple page misses may actually *increase* the amount
of I/O that VACUUM has to do over time -- we should try to get to the
pages that actually need to be vacuumed quickly, which are probably
already dirty anyway (and thus are probably going to add little to the
cost delay limit in practice). Everything is connected to everything
else.

* vacuum_cost_page_miss is very much not like random_page_cost, and
the similar names confuse the issue -- this is not an optimization
problem. Thinking about VACUUM as unrelated to the workload itself is
obviously wrong. Changing the default is also an opportunity to clear
that up.

Even if I am wrong to suggest that a miss within VACUUM should only be
thought of as 2x as expensive as a hit in some *general* sense, I am
concerned about *specific* consequences. There is no question about
picking the best access path here -- we're still going to have to
access the same blocks in the same way sooner or later. In general I
think that we should move in the direction of more frequent, cheaper
VACUUM operations [1], though we've already done a lot of work in that
direction (e.g. freeze map work).

* Some impact from VACUUM on query performance may in fact be a good thing.

Deferring the cost of vacuuming can only make sense if we'll
eventually be able to catch up because we're experiencing a surge in
demand, which seems kind of optimistic -- it seems more likely that
the GC debt will just grow and grow. Why should the DBA not expect to
experience some kind of impact, which could be viewed as a natural
kind of backpressure? The most important thing is consistent
performance.

* Other recent work such as the vacuum_cleanup_index_scale_factor
patch has increased the relative cost of index vacuuming in some
important cases: we don't have a visibility/freeze map for indexes,
but index vacuuming that doesn't dirty any pages and has a TID kill
list that's concentrated at the end of the heap/table is pretty cheap
(the TID binary search is cache efficient/cheap). This change will
help these workloads by better reflecting the way in which index
vacuuming can be cheap for append-only tables with a small amount of
garbage for recently inserted tuples that also got updated/deleted.

* Lowering vacuum_cost_page_miss's default (as opposed to changing
something else) is a simple and less disruptive way of achieving these
goals.

This approach seems unlikely to break existing VACUUM-related custom
settings from current versions that get reused on upgrade. I expect
little impact on small installations.

[1] https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
--
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Masahiko Sawada

Date:

07 January 2021, 01:39:03

On Mon, Dec 28, 2020 at 5:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Simply decreasing vacuum_cost_page_dirty seems like a low risk way of
> making the VACUUM costing more useful within autovacuum workers.
> Halving vacuum_cost_page_dirty to 5 would be a good start, though I
> think that a value as low as 2 would be better. That would make it
> only 2x vacuum_cost_page_hit's default (i.e 2x the cost of processing
> a page that is in shared_buffers but did not need to be dirtied),
> which seems sensible to me when considered in the context in which the
> value is actually applied (and not some abstract theoretical context).

Perhaps you meant to decrease vacuumm_cost_page_miss instead of
vacuum_cost_page_dirty?

>
> There are a few reasons why this seems like a good idea now:
>
> * Throttling/delaying VACUUM is only useful as a way of smoothing the
> impact on production queries, which is an important goal, but
> currently we don't discriminate against the cost that we really should
> keep under control (dirtying new pages within VACUUM) very well.
>
> This is due to the aforementioned trends, the use of a strategy ring
> buffer by VACUUM, the fact that indexes are usually vacuumed in
> sequential physical order these days, and many other things that were
> not a factor in 2004.
>
> * There is a real downside to throttling VACUUM unnecessarily, and the
> effects are *non-linear*. On a large table, the oldest xmin cutoff may
> become very old by the time we're only (say) half way through the
> initial table scan in lazy_scan_heap(). There may be relatively little
> work to do because most of the dead tuples won't be before the oldest
> xmin cutoff by that time (VACUUM just cannot keep up). Excessive
> throttling for simple page misses may actually *increase* the amount
> of I/O that VACUUM has to do over time -- we should try to get to the
> pages that actually need to be vacuumed quickly, which are probably
> already dirty anyway (and thus are probably going to add little to the
> cost delay limit in practice). Everything is connected to everything
> else.
>
> * vacuum_cost_page_miss is very much not like random_page_cost, and
> the similar names confuse the issue -- this is not an optimization
> problem. Thinking about VACUUM as unrelated to the workload itself is
> obviously wrong. Changing the default is also an opportunity to clear
> that up.
>
> Even if I am wrong to suggest that a miss within VACUUM should only be
> thought of as 2x as expensive as a hit in some *general* sense, I am
> concerned about *specific* consequences. There is no question about
> picking the best access path here -- we're still going to have to
> access the same blocks in the same way sooner or later. In general I
> think that we should move in the direction of more frequent, cheaper
> VACUUM operations [1], though we've already done a lot of work in that
> direction (e.g. freeze map work).

I agree with that direction.

>
> * Some impact from VACUUM on query performance may in fact be a good thing.
>
> Deferring the cost of vacuuming can only make sense if we'll
> eventually be able to catch up because we're experiencing a surge in
> demand, which seems kind of optimistic -- it seems more likely that
> the GC debt will just grow and grow. Why should the DBA not expect to
> experience some kind of impact, which could be viewed as a natural
> kind of backpressure? The most important thing is consistent
> performance.
>
> * Other recent work such as the vacuum_cleanup_index_scale_factor
> patch has increased the relative cost of index vacuuming in some
> important cases: we don't have a visibility/freeze map for indexes,
> but index vacuuming that doesn't dirty any pages and has a TID kill
> list that's concentrated at the end of the heap/table is pretty cheap
> (the TID binary search is cache efficient/cheap). This change will
> help these workloads by better reflecting the way in which index
> vacuuming can be cheap for append-only tables with a small amount of
> garbage for recently inserted tuples that also got updated/deleted.
>
> * Lowering vacuum_cost_page_miss's default (as opposed to changing
> something else) is a simple and less disruptive way of achieving these
> goals.
>
> This approach seems unlikely to break existing VACUUM-related custom
> settings from current versions that get reused on upgrade. I expect
> little impact on small installations.
>

I recalled the discussion decreasing the default value for
autovacuum_cost_delay from 20ms to 2ms on PostgreSQL 12. I re-read
through the discussion but there wasn't the discussion changing
hit/miss/dirty.

Whereas the change we did for autovacuum_cost_delay affects every
installation, lowering vacuum_cost_page_miss would bring a different
impact depending on workload and database size etc. For example, the
user would have a larger I/O spike in a case where the database
doesn’t fit in the server's RAM and doing vacuuming cold
tables/indexes, for example, when anti-wraparound vacuum.

Lowering vacuum_cost_page_miss basically makes sense to me. But I’m
concerned a bit that the cheaper hardware that has a small RAM etc
would be likely to be affected by this change. Since the database
doesn’t fit in the server’s RAM, pages are unlikely to be on neither
the shared buffers nor OS page cache. Since PostgreSQL's default
values seem conservative to me (which is okay to me), I think there
might be an argument that this change could lead to trouble in such a
cheaper environment that PostgreSQL’s default values are taking care
of.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

07 January 2021, 02:42:38

On Wed, Jan 6, 2021 at 5:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Perhaps you meant to decrease vacuumm_cost_page_miss instead of
> vacuum_cost_page_dirty?

You're right. Evidently I didn't write this email very carefully.
Sorry about that.

To say it again: I think that a miss (without dirtying the page)
should be cheaper than dirtying a page. This thread began because I
wanted to discuss the relative cost of different kinds of I/O
operations to VACUUM, without necessarily discussing the absolute
costs/time delays.

-- 
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

07 January 2021, 03:43:08

On Wed, Jan 6, 2021 at 5:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Dec 28, 2020 at 5:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > Simply decreasing vacuum_cost_page_dirty seems like a low risk way of
> > making the VACUUM costing more useful within autovacuum workers.
> > Halving vacuum_cost_page_dirty to 5 would be a good start, though I
> > think that a value as low as 2 would be better. That would make it
> > only 2x vacuum_cost_page_hit's default (i.e 2x the cost of processing
> > a page that is in shared_buffers but did not need to be dirtied),
> > which seems sensible to me when considered in the context in which the
> > value is actually applied (and not some abstract theoretical context).
>
> Perhaps you meant to decrease vacuumm_cost_page_miss instead of
> vacuum_cost_page_dirty?
>
> >
> > There are a few reasons why this seems like a good idea now:
> >
> > * Throttling/delaying VACUUM is only useful as a way of smoothing the
> > impact on production queries, which is an important goal, but
> > currently we don't discriminate against the cost that we really should
> > keep under control (dirtying new pages within VACUUM) very well.
> >
> > This is due to the aforementioned trends, the use of a strategy ring
> > buffer by VACUUM, the fact that indexes are usually vacuumed in
> > sequential physical order these days, and many other things that were
> > not a factor in 2004.
> >
> > * There is a real downside to throttling VACUUM unnecessarily, and the
> > effects are *non-linear*. On a large table, the oldest xmin cutoff may
> > become very old by the time we're only (say) half way through the
> > initial table scan in lazy_scan_heap(). There may be relatively little
> > work to do because most of the dead tuples won't be before the oldest
> > xmin cutoff by that time (VACUUM just cannot keep up). Excessive
> > throttling for simple page misses may actually *increase* the amount
> > of I/O that VACUUM has to do over time -- we should try to get to the
> > pages that actually need to be vacuumed quickly, which are probably
> > already dirty anyway (and thus are probably going to add little to the
> > cost delay limit in practice). Everything is connected to everything
> > else.
> >
> > * vacuum_cost_page_miss is very much not like random_page_cost, and
> > the similar names confuse the issue -- this is not an optimization
> > problem. Thinking about VACUUM as unrelated to the workload itself is
> > obviously wrong. Changing the default is also an opportunity to clear
> > that up.
> >
> > Even if I am wrong to suggest that a miss within VACUUM should only be
> > thought of as 2x as expensive as a hit in some *general* sense, I am
> > concerned about *specific* consequences. There is no question about
> > picking the best access path here -- we're still going to have to
> > access the same blocks in the same way sooner or later. In general I
> > think that we should move in the direction of more frequent, cheaper
> > VACUUM operations [1], though we've already done a lot of work in that
> > direction (e.g. freeze map work).
>
> I agree with that direction.
>
> >
> > * Some impact from VACUUM on query performance may in fact be a good thing.
> >
> > Deferring the cost of vacuuming can only make sense if we'll
> > eventually be able to catch up because we're experiencing a surge in
> > demand, which seems kind of optimistic -- it seems more likely that
> > the GC debt will just grow and grow. Why should the DBA not expect to
> > experience some kind of impact, which could be viewed as a natural
> > kind of backpressure? The most important thing is consistent
> > performance.
> >
> > * Other recent work such as the vacuum_cleanup_index_scale_factor
> > patch has increased the relative cost of index vacuuming in some
> > important cases: we don't have a visibility/freeze map for indexes,
> > but index vacuuming that doesn't dirty any pages and has a TID kill
> > list that's concentrated at the end of the heap/table is pretty cheap
> > (the TID binary search is cache efficient/cheap). This change will
> > help these workloads by better reflecting the way in which index
> > vacuuming can be cheap for append-only tables with a small amount of
> > garbage for recently inserted tuples that also got updated/deleted.
> >
> > * Lowering vacuum_cost_page_miss's default (as opposed to changing
> > something else) is a simple and less disruptive way of achieving these
> > goals.
> >
> > This approach seems unlikely to break existing VACUUM-related custom
> > settings from current versions that get reused on upgrade. I expect
> > little impact on small installations.
> >
>
> I recalled the discussion decreasing the default value for
> autovacuum_cost_delay from 20ms to 2ms on PostgreSQL 12. I re-read
> through the discussion but there wasn't the discussion changing
> hit/miss/dirty.
>
> Whereas the change we did for autovacuum_cost_delay affects every
> installation, lowering vacuum_cost_page_miss would bring a different
> impact depending on workload and database size etc. For example, the
> user would have a larger I/O spike in a case where the database
> doesn’t fit in the server's RAM and doing vacuuming cold
> tables/indexes, for example, when anti-wraparound vacuum.
>
> Lowering vacuum_cost_page_miss basically makes sense to me. But I’m
> concerned a bit that the cheaper hardware that has a small RAM etc
> would be likely to be affected by this change. Since the database
> doesn’t fit in the server’s RAM, pages are unlikely to be on neither
> the shared buffers nor OS page cache. Since PostgreSQL's default
> values seem conservative to me (which is okay to me), I think there
> might be an argument that this change could lead to trouble in such a
> cheaper environment that PostgreSQL’s default values are taking care
> of.

More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It
has had the value as 10 as its default since 2004 (just like
vacuum_cost_page_dirty, whose default has also not been changed since
the start). These defaults were decided in a time when nbtree VACUUM
could do lots of random I/O, there was no visibility map, etc. So this
refresh is not just about hardware.

A useful consequence of halving vacuum_cost_page_miss's default
setting is that we still get a "built-in delay" when an I/O miss is
naturally slow (which is really hard to predict). On the other hand,
dirtying lots of pages within VACUUM is dangerous - VACUUM cannot be
allowed to create "I/O debt" very quickly, which is quite possible due
to the lack of any "natural back pressure".

--
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

14 January 2021, 00:24:12

On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It
> has had the value as 10 as its default since 2004 (just like
> vacuum_cost_page_dirty, whose default has also not been changed since
> the start). These defaults were decided in a time when nbtree VACUUM
> could do lots of random I/O, there was no visibility map, etc. So this
> refresh is not just about hardware.

Attached patch lowers vacuum_cost_page_miss to 3. I think that this
change in the default is both likely to be helpful in medium to large
installations, and unlikely to cause harm in small installations. If
I/O for reads made by VACUUM is naturally very slow (even in the
common case where it's entirely sequential), then that will naturally
provide additional throttling.

I will add this to the final CF for Postgres 14.

Thanks
-- 
Peter Geoghegan

Attachment

v1-0001-Lower-vacuum_cost_page_miss-s-default-to-3.patch

Re: vacuum_cost_page_miss default value and modern hardware

From

Magnus Hagander

Date:

14 January 2021, 17:29:28

On Thu, Jan 14, 2021 at 1:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It
> > has had the value as 10 as its default since 2004 (just like
> > vacuum_cost_page_dirty, whose default has also not been changed since
> > the start). These defaults were decided in a time when nbtree VACUUM
> > could do lots of random I/O, there was no visibility map, etc. So this
> > refresh is not just about hardware.
>
> Attached patch lowers vacuum_cost_page_miss to 3. I think that this
> change in the default is both likely to be helpful in medium to large
> installations, and unlikely to cause harm in small installations. If
> I/O for reads made by VACUUM is naturally very slow (even in the
> common case where it's entirely sequential), then that will naturally
> provide additional throttling.

+1 for this in principle.

Do you have any actual metrics between specifically choosing the value
3? Or is that off a gut feeling?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: vacuum_cost_page_miss default value and modern hardware

From

Robert Haas

Date:

14 January 2021, 18:42:15

On Thu, Jan 14, 2021 at 12:29 PM Magnus Hagander <magnus@hagander.net> wrote:
> +1 for this in principle.

I'm not opposed to this change and I agree that the relative expense
of dirtying a page is higher than what the current defaults suggest.
So I also think Peter is going in the right general direction, though
like you I am not sure about the specifics.

In practice, most users get hosed by not vacuuming aggressively
enough, rather than by vacuuming too aggressively. For instance,
suppose you have a table and workload such that the table needs to be
vacuumed once per hour to maintain good performance. As you make the
table bigger and bigger, you will eventually reach a size where the
configured cost limits aren't high enough to permit this to happen.
The system has no option to disregard the configured limit, even for
an emergency autovacuum. Eventually the user is forced into an outage
either by the table becoming so bloated that VACUUM FULL is required,
or by running out of XIDs. It seems bad that we ship a set of default
settings that are guaranteed to hose any database with a reasonable
number of updates once the database size exceeds some limit. The fact
that we decreased autovacuum_cost_delay by 10x increased the limit by
10x, which is good, but the problem remains. I don't know exactly how
to do better, and any proposal in that area would be much more
complicated than what Peter is proposing here, but it's something to
think about.

There are also users I've seen get hosed by vacuuming too
aggressively. I have seen this happen in two ways. One is too much
dirty data. The other is too much read I/O, pushing hot data out of
the cache, leading to a storm of random I/O later when the foreground
workload needs to get that stuff back, basically killing the system,
sometimes for hours, while it tries to get back the stuff it lost.
That might seem like an argument against further raising the possible
I/O rate, which would be the effect of the change Peter is proposing,
but that's not really my position. I think the bigger problem with all
this is that it's too hard to configure; almost nobody can work out
what a given set of configuration parameters actually means in MB/s or
GB/hour. In the past I've proposed that maybe we should redesign this
whole system to work in those kinds of units, which people actually
understand, but I don't know if that's the right idea. Still another
approach would be to try to reduce the degree to which the cache gets
trashed, or make it have less harmful effect on future performance by
reading things back in more efficiently. I don't really know.

TL;DR: This change is fine with me, but this whole system has much
deeper issues.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

15 January 2021, 00:23:15

On Thu, Jan 14, 2021 at 9:29 AM Magnus Hagander <magnus@hagander.net> wrote:
> Do you have any actual metrics between specifically choosing the value
> 3? Or is that off a gut feeling?

I have no metrics, exactly, but I'm sure that the trend I mentioned
about page cleaning/dirtying being the bottleneck more and more these
days is true. This trend is very apparent to all of this, it seems, so
I am sure that I basically have the right idea here. I'm a little
concerned that it should actually be lowered to 2.

With that said, I don't actually accept what seems to be the original
premise of these GUCs, so I am not interested in using that to justify
changing the vacuum_cost_page_miss default. The premise seems to be:
VACUUM's behavior is determined by treating it as an optimization
problem, so all you as the DBA need to do is characterize the cost of
each kind of elementary operation using the GUCs -- the dynamic
algorithm will do the rest. What algorithm might that be, though? This
is not the optimizer, and there is no scope to come up with a cheaper
plan for VACUUM. Why not throttle longer running queries instead, or
as well?

More on the first principles of the costing stuff in a bit, when I
respond to Robert...

--
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

15 January 2021, 01:09:00

On Thu, Jan 14, 2021 at 10:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> There are also users I've seen get hosed by vacuuming too
> aggressively. I have seen this happen in two ways. One is too much
> dirty data. The other is too much read I/O, pushing hot data out of
> the cache, leading to a storm of random I/O later when the foreground
> workload needs to get that stuff back, basically killing the system,
> sometimes for hours, while it tries to get back the stuff it lost.

> TL;DR: This change is fine with me, but this whole system has much
> deeper issues.

It seems like there is a good chance the customers of yours that
complained about the read I/O (which was not accompanied by dirtying)
were really bothered by all of their indexes being read by VACUUM. The
freeze map is probably quite effective as far as that goes, but the
burden of index vacuuming is what they tend to notice. This is perhaps
made worse by the sudden infrequent nature of the index vacuuming
against a big append-only or append-mostly table. I imagine that the
problem here is that we're doing index vacuuming when we shouldn't be
-- these customers basically had it right. Their intuition that this
is unnecessary is in fact the right one. How can it be okay to vacuum
an index when the table only has 10 dead tuples (just to freeze some
pages at the end of the table)? That's ridiculous. And it has nothing
to do with these settings. (Even if I'm wrong to suggest that that was
what it was, I think that the details and nuance of what actually
happened is likely to be important.)

We should be avoiding index vacuuming in many more cases. If there are
only a tiny number of garbage index tuples, then we really shouldn't
bother (once again, please feel free to weigh in on Masahiko's patch
over on the "New IndexAM API controlling index vacuum strategies"
thread -- that's very interesting work IMV). Bottom-up index deletion
creates far more opportunities for this kind of stuff to naturally
occur. It will now do ~99.9% of garbage tuple cleanup in indexes that
naturally use it all the time. We can expect that intuitions held by
DBAs that have experience with other RDBMSs will start to have more
predictive power when they think about Postgres and VACUUM, which
seems like a very good thing (and something that we can and should
continue to build on). Roughly speaking, we ought to look for more and
more ways to make the physical representation of the data closer to
the logical contents of the database (that's what these DBAs start
with, that's where the intuitions seem to start with, which actually
makes perfect sense).

Now back to the main topic, the GUC's default value. I believe that
your experiences here (the experiences in both directions) are
representative -- I think I've heard of all that myself. Like you, I
think that insufficient vacuuming is much more common than excessive
vacuuming. You do still have some cases where an excessive amount of
I/O from VACUUM (without any dirtying) is the issue (or at least
*seems* to be the issue, per my aside). I think that I have a high
level theory that is consistent with what you say and may suggest a
better direction for us, but it's tricky. I'll try to resolve the
apparent contradictions in my own arguments as I go (I'm a little
burnt out at the moment, so please indulge me).

I think that The Real Problem is *not* that it's too hard to tune this
stuff as a practical matter, exactly. The entire premise of these
costing parameters is that the DBA can and should make a trade-off
between query response time/workload throughput and vacuuming, as if
these two things were separate constituents that are simply unrelated.
That sounds completely wrong to me. It sounds so wrong that I can't go
down that mental path for more than 5 seconds without giving up on it.
Are we really expected to believe that in general VACUUM probably has
all the time in the world, and so should proceed at a leisurely pace?
It's almost as if the original designer imagined that the IT
department should be made to wait on the result of one of those
"VACUUM bigtable;" reports that they seem to be so keen on (other
queries are running that deliver real business value, after all). I'm
only half-joking here -- anybody reading this should now take a moment
to actively consider just how little sense any of this makes. It's so
completely and implausibly wrong that it seems likely to actually be
slightly right, if only by mistake.

There seems to be one important way in which the cost parameter design
is accidentally useful: the page dirtying stuff probably works
reasonably well. It really does make sense to throttle VACUUM in
response to dirtying pages, optimistically assuming that VACUUM will
eventually catch up. That part makes sense precisely because it seems
like it treats VACUUM as a thing that is directly tied to the workload
(an accidental happy exception to the bogus general rule for the
costing stuff). Of course, this optimism does not work out because it
is true in some general sense that these situations will work
themselves out naturally (that's definitely not something we should
*truly* assume). There is a deeper way in which it works as a
heuristic, though: it covers all the possibilities at once (kind of),
without truly predicting what will happen with the workload. We need
to be adaptive whenever we think about GC/vacuuming, I believe --
worrying about the general or common case (for any value of common) is
often a waste of time IME. I'll now go through two different scenarios
in which this same assumption works in our favor.

I'll start with the simple case here: It's possible that VACUUM is
throttled consistently by dirtying pages, slowing it down
considerably. The situation may naturally resolve itself in an
uncomplicated way (e.g. the bursty application calms down following a
spike), in which case the original foundational design assumption I
described works out in the obvious way. Easy.

Now for the more complicated case: VACUUM is throttled consistently by
dirtying pages (if anything), but the database load does not and will
not go down. We *still* end up doing approximately the right thing
here (again this is assuming you reduce vacuum_cost_page_miss to
something like 3). The fact that we're dirtying new pages (when
throttled) suggests that the garbage in these pages really is quite
old (it's "floating garbage" in GC theory terms). So we don't want to
slow everything down by dirtying ever-more pages. At the same time we
want to get through already-dirty pages as quickly as possible, and
also quickly move past pages not in shared buffers and not dirtied
(those are relatively cheap on modern hardware). Opportunistic heap
pruning eventually runs enough that VACUUM won't have to dirty most
heap pages in a big table (in its first pass).

So dirty pages are debt that VACUUM can easily create, whereas buffer
misses are paid directly by VACUUM. It is its own backpressure, for
the most part. Making the costing stuff highly sensitive to dirtying
pages (but not sensitive to much else) works out because it either
avoids making a bad situation worse, or has no real additional
downside when the system is completely overwhelmed (i.e. bottlenecked
on cleaning dirty pages).

If it's just impossible for buffer cleaning to really keep up then
things will naturally slow down. This might even be seen as a natural
thing -- workloads must live within their means. We should do our best
to make it less likely that things slow down, by improving various
important subsystems, whatever it might be (e.g. the io_uring stuff
from Andres). But ultimately there is no way to reliably avoid having
the system get overwhelmed by dirty pages. The fact that such an
outcome is always possible is no reason to double down on dubious
top-down interventions from VACUUM. You cannot get out of debt by
taking out another loan to pay off your existing creditors.

--
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Masahiko Sawada

Date:

15 January 2021, 04:33:38

On Thu, Jan 14, 2021 at 9:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It
> > has had the value as 10 as its default since 2004 (just like
> > vacuum_cost_page_dirty, whose default has also not been changed since
> > the start). These defaults were decided in a time when nbtree VACUUM
> > could do lots of random I/O, there was no visibility map, etc. So this
> > refresh is not just about hardware.
>
> Attached patch lowers vacuum_cost_page_miss to 3. I think that this
> change in the default is both likely to be helpful in medium to large
> installations, and unlikely to cause harm in small installations. If
> I/O for reads made by VACUUM is naturally very slow (even in the
> common case where it's entirely sequential), then that will naturally
> provide additional throttling.

+1 for this change. Lowering to 2 also looks good to me.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

22 January 2021, 01:12:01

On Thu, Jan 14, 2021 at 8:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> +1 for this change. Lowering to 2 also looks good to me.

I'm going to go ahead with committing my patch to lower the default
next week. If anybody has any objections to that plan, please speak
up.

It doesn't really need to be said again, but just to be clear: I share
the concerns held by Magnus and Robert. It's certainly true that this
change alone is no fix for the general problem I described. At the
same time I believe that the patch makes incremental progress in the
right direction, without much risk.

-- 
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

27 January 2021, 23:19:13

On Thu, Jan 21, 2021 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm going to go ahead with committing my patch to lower the default
> next week. If anybody has any objections to that plan, please speak
> up.

Just pushed a commit that reduced the default for vacuum_cost_page_miss to 2.

One more thing on this: I noticed that the docs for these settings say
"There are many situations where it is not important that maintenance
commands like VACUUM and ANALYZE finish quickly". That now seems a
little dubious to me -- I wonder if we should refresh it.

There are not that many problems that can be solved by making VACUUM
slower. This is due to the visibility map, and to a lesser degree
certain key improvements in index AMs. The text that I quoted was
written in 2005, a time when the delay stuff was still very new, and
even the earliest visibility map design was still a few years away.

Thanks
--
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

From

Robert Haas

Date:

28 January 2021, 17:29:56

On Thu, Jan 14, 2021 at 8:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> So dirty pages are debt that VACUUM can easily create, whereas buffer
> misses are paid directly by VACUUM. It is its own backpressure, for
> the most part. Making the costing stuff highly sensitive to dirtying
> pages (but not sensitive to much else) works out because it either
> avoids making a bad situation worse, or has no real additional
> downside when the system is completely overwhelmed (i.e. bottlenecked
> on cleaning dirty pages).

This isn't really true. The cost of a buffer miss is not limited to
the cost of reading the replacement buffer, a cost which is paid by
VACUUM. It is also very often the cost of rereading the evicted
buffer, which VACUUM does nothing about. Customers get hosed by VACUUM
reading a lot of rarely-used data overnight and evicting all of the
actually-hot data from cache. Then in the morning when the workload
picks up the system starts trying to pull the stuff they actually need
into memory one block at a time. Such a customer can go from a 99% hit
rate on Monday morning to say 50% on Tuesday morning, which results in
a fifty-fold increase in I/O, all of which is random I/O. The system
basically grinds to a halt for hours.

It is fair to argue that perhaps such customers should invest in more
and better hardware. In some cases, a customer who can fit 1% of their
database in cache is relying on a 99% cache hit ratio, which is
precarious at best. But, they can sometimes get away with it until a
large batch job like VACUUM gets involved.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: vacuum_cost_page_miss default value and modern hardware

From

Peter Geoghegan

Date:

28 January 2021, 17:57:28

On Thu, Jan 28, 2021 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 14, 2021 at 8:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > So dirty pages are debt that VACUUM can easily create, whereas buffer
> > misses are paid directly by VACUUM. It is its own backpressure, for
> > the most part. Making the costing stuff highly sensitive to dirtying
> > pages (but not sensitive to much else) works out because it either
> > avoids making a bad situation worse, or has no real additional
> > downside when the system is completely overwhelmed (i.e. bottlenecked
> > on cleaning dirty pages).
>
> This isn't really true. The cost of a buffer miss is not limited to
> the cost of reading the replacement buffer, a cost which is paid by
> VACUUM. It is also very often the cost of rereading the evicted
> buffer, which VACUUM does nothing about. Customers get hosed by VACUUM
> reading a lot of rarely-used data overnight and evicting all of the
> actually-hot data from cache.

Well, I did say "for the most part". In any case there is not much
reason to think that throttling VACUUM on shared_buffers page misses
can make very much difference in this scenario.

> It is fair to argue that perhaps such customers should invest in more
> and better hardware. In some cases, a customer who can fit 1% of their
> database in cache is relying on a 99% cache hit ratio, which is
> precarious at best. But, they can sometimes get away with it until a
> large batch job like VACUUM gets involved.

Actually, my first observation here is that VACUUM probably shouldn't
do this at all. In other words, I agree with what I suspect your
customer's intuition was in a rather straightforward way: VACUUM
really shouldn't be reading several large indexes in full when they
have barely been modified in months or years -- that's the real
problem.

It ought to be possible to make big improvements in that area without
changing the fundamental invariants. I am once again referring to the
pending work on VACUUM from Masahiko.

-- 
Peter Geoghegan