Re: Eager page freeze criteria clarification - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Eager page freeze criteria clarification
Date
Msg-id CAH2-WznLpXJg-aoUjo9ewWgJVqWTzYSTFUT_BBRKHt0iSjxvrA@mail.gmail.com
Whole thread Raw
In response to Re: Eager page freeze criteria clarification  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Even if you're willing to assume that vacuum_freeze_min_age isn't just
> > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age
> > is applied by VACUUM, at the point that it scans pages. If VACUUM were
> > infinitely fast, and new VACUUMs were launched constantly, then
> > vacuum_freeze_min_age (and this bucketing scheme) might make more
> > sense. But, you know, they're not. So whether or not VACUUM (with
> > Andres' algorithm) deems a page that it has frozen to have been
> > opportunistically frozen or not is greatly influenced by factors that
> > couldn't possibly be relevant.
>
> I'm not totally sure that I'm understanding what you're concerned
> about here, but I *think* that the issue you're worried about here is:
> if we have various rules that can cause freezing, let's say X Y and Z,
> and we adjust the aggressiveness of rule X based on the performance of
> rule Y, that would be stupid and might suck.

Your summary is pretty close. There are a couple of specific nuances
to it, though:

1. Anything that uses XID age or even LSN age necessarily depends on
when VACUUM shows up, which itself depends on many other random
things.

With small to medium sized tables that don't really grow, it's perhaps
reasonable to expect this to not matter. But with tables like the
TPC-C order/order lines table, or even pgbench_history, the next
VACUUM operation will reliably be significantly longer and more
expensive than the last one, forever (ignoring the influence of
aggressive mode, and assuming typical autovacuum settings). So VACUUMs
get bigger and less frequent as the table grows.

As the table continues to grow, at some point we reach a stage where
many XIDs encountered by VACUUM will be significantly older than
vacuum_freeze_min_age, while others will be significantly younger. And
so whether we apply the vacuum_freeze_min_age rule (or some other age
based rule) is increasingly a matter of random happenstance (i.e. is
more and more due to when VACUUM happens to show up), and has less and
less to do with what the workload signals we should do. This is a
moving target, but (if I'm not mistaken) under the scheme described by
Andres we're not even trying to compensate for that.

Separately, I have a practical concern:

2. It'll be very hard to independently track the effectiveness of
rules X, Y, and Z as a practical matter, because the application of
each rule quite naturally influences the application of every other
rule over time. They simply aren't independent things in any practical
sense.

Even if this wasn't an issue, I can't think of a reasonable cost
model. Is it good or bad if "opportunistic freezing" results in
unfreezing 50% of the time? AFAICT that's an *extremely* complicated
question. You cannot just interpolate from the 0% case (definitely
good) and the 100% case (definitely bad) and expect to get a sensible
answer. You can't split the difference -- even if we allow ourselves
to ignore tricky value judgement type questions.

> Assuming that the previous sentence is a correct framing, let's take X
> to be "freezing based on the page LSN age" and Y to be "freezing based
> on vacuum_freeze_min_age". I think the problem scenario here would be
> if it turns out that, under some set of circumstances, Y freezes more
> aggressively than X. For example, suppose the user runs VACUUM FREEZE,
> effectively setting vacuum_freeze_min_age=0 for that operation. If the
> table is being modified at all, it's likely to suffer a bunch of
> unfreezing right afterward, which could cause us to decide to make
> future vacuums freeze less aggressively. That's not necessarily what
> we want, because evidently the user, at least at that moment in time,
> thought that previous freezing hadn't been aggressive enough. They
> might be surprised to find that flash-freezing the table inhibited
> future automatic freezing.

I didn't think of that one myself, but it's a great example.

> Or suppose that they just have a very high XID consumption rate
> compared to the rate of modifications to this particular table, such
> that criteria related to vacuum_freeze_min_age tend to be satisfied a
> lot, and thus vacuums tend to freeze a lot no matter what the page LSN
> age is. This scenario actually doesn't seem like a problem, though. In
> this case the freezing criterion based on page LSN age is already not
> getting used, so it doesn't really matter whether we tune it up or
> down or whatever.

It would have to be a smaller table, which I'm relatively unconcerned about.

> > Okay then. I guess it's more accurate to say that we'll have a strong
> > bias in the direction of freezing when an FPI won't result, though not
> > an infinitely strong bias. We'll at least have something that can be
> > thought of as an improved version of the FPI thing for 17, I think --
> > which is definitely significant progress.
>
> I do kind of wonder whether we're going to care about the FPI thing in
> the end. I don't mind if we do. But I wonder if it will prove
> necessary, or even desirable. Andres's algorithm requires a greater
> LSN age to trigger freezing when an FPI is required than when one
> isn't. But Melanie's test results seem to me to show that using a
> small LSN distance freezes too much on pgbench_accounts-type workloads
> and using a large one freezes too little on insert-only workloads. So
> I'm currently feeling a lot of skepticism about how useful it is to
> vary the LSN-distance threshold as a way of controlling the behavior.

I'm skeptical of varying the LSN distance, but I'm not skeptical of
the idea of caring about FPIs in general.

I wonder how much truly useful work VACUUM performed for
pgbench_accounts during Melanie's performance evaluation -- leaving
freezing aside. For the "too much freezing for pgbench_accounts" case,
where master performed better than the patch, would it have been
possible to do even better than that by simply turning off autovacuum?
Or at least increasing the scale factor that triggers autovacuuming?
(The answer will depend to some extent on heap fill factor.)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: SHARED locks barging behaviour
Next
From: Thomas Munro
Date:
Subject: Re: how to manage Cirrus on personal repository