Re: Eager page freeze criteria clarification - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Eager page freeze criteria clarification
Date
Msg-id CAH2-Wzme6k4f3U7dk7DU0OR9H3gsSRtDF-BfQijSF9tXWPup5w@mail.gmail.com
Whole thread Raw
In response to Re: Eager page freeze criteria clarification  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Eager page freeze criteria clarification
List pgsql-hackers
On Mon, Sep 25, 2023 at 11:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > The reason I was thinking of using the "lsn at the end of the last vacuum", is
> > that it seems to be more adapative to the frequency of vacuuming.
>
> Yes, but I think it's *too* adaptive. The frequency of vacuuming can
> plausibly be multiple times per minute or not even annually. That's
> too big a range of variation.

+1. The risk of VACUUM chasing its own tail seems very real. We want
VACUUM to be adaptive to the workload, not adaptive to itself.

> Yeah, I don't know if that's exactly the right idea, but I think it's
> in the direction that I was thinking about. I'd even be happy with
> 100% of the time-between-recent checkpoints, maybe even 200% of
> time-between-recent checkpoints. But I think there probably should be
> some threshold beyond which we say "look, this doesn't look like it
> gets touched that much, let's just freeze it so we don't have to come
> back to it again later."

The sole justification for any strategy that freezes lazily is that it
can avoid useless freezing when freezing turns out to be unnecessary
-- that's it. So I find it more natural to think of freezing as the
default action, and *not freezing* as the thing that requires
justification. Thinking about it "backwards" like that just seems
simpler to me. There is only one possible reason to not freeze, but
several reasons to freeze.

> I think part of the calculus here should probably be that when the
> freeze threshold is long, the potential gains from making it even
> longer are not that much. If I change the freeze threshold on a table
> from 1 minute to 1 hour, I can potentially save uselessly freezing
> that page 59 times per hour, every hour, forever, if the page always
> gets modified right after I touch it. If I change the freeze threshold
> on a table from 1 hour to 1 day, I can only save 23 unnecessary
> freezes per day.

I totally agree with you on this point. It seems related to my point
about "freezing being the conceptual default action" in VACUUM.

Generally speaking, over-freezing is a problem when we reach the same
wrong conclusion (freeze the page) about the same relatively few pages
over and over -- senselessly repeating those mistakes really adds up
when you're vacuuming the same table very frequently. On the other
hand, under-freezing is typically a problem when we reach the same
wrong conclusion (don't freeze the page) about lots of pages only once
in a very long while. I strongly suspect that there is very little
gray area between the two, across the full spectrum of application
characteristics.

Most individual pages have very little chance of being modified in the
short to medium term. In a perfect world, with a perfect algorithm,
we'd almost certainly be freezing most pages at the earliest
opportunity. It is nevertheless also true that a freezing policy that
is only somewhat more aggressive than this ideal oracle algorithm will
freeze way too aggressive (by at least some measures). There isn't
much of a paradox to resolve here: it's all down to the cadence of
vacuuming, and of rows subject to constant churn.

As you point out, the "same policy" can produce dramatically different
outcomes when you actually consider what the consequences of the
policy are over time, when applied by VACUUM under a variety of
different workload conditions. So any freezing policy must be designed
with due consideration for those sorts of things. If VACUUM doesn't
freeze the page now, then when will it freeze it? For most individual
pages, that time will come (again, pages that benefit from lazy
vacuuming are the exception rather than the rule). Right now, VACUUM
almost behaves as if it thought "that's not my problem, it's a problem
for future me!".

Trying to differentiate between pages that we must not over freeze and
pages that we must not under freeze seems important. Generational
garbage collection (as used by managed VM runtimes) does something
that seems a little like this. It's based on the empirical observation
that "most objects die young". What the precise definition of "young"
really is varies significantly, but that turns out to be less of a
problem than you might think -- it can be derived through feedback
cycles. If you look at memory lifetimes on a logarithmic scale, very
different sorts of applications tend to look like they have remarkably
similar memory allocation characteristics.

> Percentage-wise, the overhead of being wrong is the
> same in both cases: I can have as many extra freeze operations as I
> have page modifications, if I pick the worst possible times to freeze
> in every case. But in absolute terms, the savings in the second
> scenario are a lot less.

Very true.

I'm surprised that there hasn't been any discussion of the absolute
amount of system-wide freeze debt on this thread. If 90% of the pages
in the entire database are frozen, it'll generally be okay if we make
the wrong call by freezing lazily when we shouldn't. This is doubly
true within small to medium sized tables, where the cost of catching
up on freezing cannot ever be too bad (concentrations of unfrozen
pages in one big table are what really hurt users).

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Add 'worker_type' to pg_stat_subscription
Next
From: Andres Freund
Date:
Subject: Re: Performance degradation on concurrent COPY into a single relation in PG16.