Re: Eager page freeze criteria clarification - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Eager page freeze criteria clarification
Date
Msg-id CA+TgmoYcWisxLBL-pXu13OevThLOXm20oJqjNRZtKkhXsY92XA@mail.gmail.com
Whole thread Raw
In response to Re: Eager page freeze criteria clarification  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
On Wed, Sep 27, 2023 at 7:09 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> At the risk of seeming too execution-focused, I want to try and get more
> specific. Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

When I first read this, I thought this sounded right, except for the
naming which doesn't mention freezing, but on further thought, this
feels like it might not be the right thing. Why do I care how many
super-old pages (regardless of how exactly we define super-old) I
unfroze? I think what I care about is more like: how often do I have
to unfreeze a recently-frozen page in order to complete a DML
operation?

For example, consider a table with a million pages, 10 of which are
frozen. 1 was frozen a long time ago, 9 were frozen recently, and the
other 999,990 are unfrozen. I update every page in the table. Well,
now older_than_cpt = 1 and younger_than_cpt = 9, so the ratio of the
two is terrible: 90% of the frozen pages I encountered were
recently-frozen. But that's irrelevant. What matters is that if I had
frozen less aggressively, I could have saved myself 9 unfreeze
operations across a million page modifications. So actually I'm doing
great: only 0.0009% of my page modifications required an unfreeze that
I maybe should have avoided and didn't.

Likewise, if all of the pages had been frozen, but only 9 were frozen
recently, I'm doing exactly equally great. But if the number of
recently frozen pages were higher, then I'd be doing less great. So I
think comparing the number of young pages unfrozen to the number of
old pages unfrozen is the wrong test, and comparing it instead to the
number of pages modified is a better test.

But I don't think it's completely right, either. Imagine a table with
100 recently frozen pages. I modify the table and unfreeze one of
them. So, 100% of the pages that I unfroze were recently-frozen. Does
this mean that we should back off and freeze less aggressively? Not
really. It seems like I actually got this case 99% right. So another
idea could be to look at what percentage of frozen pages I end up
un-freezing shortly thereafter. That gets this example right. But it
gets the earlier example with a million pages wrong.

Maybe there's something to combining these ideas. If the percentage of
page modifications (of clean pages, say, so that we don't skew the
stats as much when a page is modified many times in a row) that have
to un-freeze a recently-frozen page is low, then that means there's no
real performance penalty for whatever aggressive freezing we're doing.
Even if we un-freeze a zillion pages, if that happens over the course
of modifying 100 zillion pages, it's not material. On the flip side,
if the percentage of frozen pages that get unfrozen soon thereafter is
low, then it feels worth doing even if most page modifications end up
un-freezing a recently-frozen page, because on net we're still ending
up with a lot of extra stuff frozen.

Said differently, we're freezing too aggressively if un-freezing is
both adding a significant expense (i.e. the foreground work we're
doing often requires un-freezing ages) and ineffective (i.e. we don't
end up with frozen pages left over). If it has one of those problems,
it's still OK, but if it has both, we need to back off.

Maybe that's still not quite right, but it's the best I've got this morning.

> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,
>
>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

My guess is that, when we're being aggressive, we should be aiming to
freeze all pages that can be completely frozen with no additional
criterion. Any additional criterion that we add here is likely to mess
up the insert-only table case, and I'm not really sure what it saves
us from. On the other hand, when we're being non-aggressive, we might
still sometimes want to freeze in situations where we historically
haven't. Consider the following three examples:

1. pgbench_accounts table, standard run
2. pgbench_accounts table, highly skewed run
3. slowly growing table with a hot tail. records are inserted, updated
heavily for a while, thereafter updated only very occasionally.

Aggressive freezing could misfire in all of these cases, because all
of them have some portion of the table where rows are being updated
very frequently. But in the second and third cases, there's also a
cold portion of the table where aggressive freezing is still OK.
Unless we could figure out how to get stats with page-level
granularity, which sounds infeasible, I think cases (2) and (3) will
be hard to get completely right. However, if we opportunistically
froze any page that hadn't been modified in the last checkpoint cycle,
or any page that hadn't been modified in the last 2 checkpoint cycles,
or any page we had to read into shared_buffers (following an earlier
suggestion from Andres), or any page that hadn't been modified in the
last 5 minutes, it seems like we could get a bit of extra freezing
that we're not getting today without too many downsides. We probably
need whatever criterion we use here to be pretty lenient to avoid
freezing the hot pages, which probably means we'll sometimes fail to
freeze pages that are actually cold but we don't quite realize it. But
that could still be better than today.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: how to manage Cirrus on personal repository
Next
From: Noah Misch
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)