Re: Eager page freeze criteria clarification - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Eager page freeze criteria clarification
Date
Msg-id 20230928013500.ujersr7twi5k7oxn@alap3.anarazel.de
Whole thread Raw
In response to Re: Eager page freeze criteria clarification  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Eager page freeze criteria clarification
List pgsql-hackers
On 2023-09-27 19:09:41 -0400, Melanie Plageman wrote:
> On Wed, Sep 27, 2023 at 3:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund <andres@anarazel.de> wrote:
> > > One way to deal with that would be to not track the average age in
> > > LSN-difference-bytes, but convert the value to some age metric at that
> > > time. If we e.g. were to convert the byte-age into an approximate age in
> > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1
> > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a mean
> > > of that age would probably be fine.
> >
> > Yes. I think it's possible that we could even get by with just two
> > buckets. Say current checkpoint and not. Or current-or-previous
> > checkpoint and not. And just look at what percentage of accesses fall
> > into this first bucket -- it should be small or we're doing it wrong.
> > It seems like the only thing we actually need to avoid is freezing the
> > same ages over and over again in a tight loop.
>
> At the risk of seeming too execution-focused, I want to try and get more
> specific.

I think that's a good intuition :)

> Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

> This has the downside of counting most unfreezings directly after a
> checkpoint in the older_than_cpt bucket. That is: older_than_cpt !=
> longer_frozen_duration at certain times in the checkpoint cycle.

Yea - I don't think just using before/after checkpoint is a good measure. As
you say, it'd be quite jumpy around checkpoints - even though the freezing
behaviour hasn't materially changed. I think using the *distance* between
checkpoints would be a more reliable measure, i.e. if (insert_lsn - page_lsn)
< recent_average_lsn_diff_between_checkpoints, then it's recently modified,
otherwise not.

One problem with using checkpoints "distances" to control things is
forced/immediate checkpoints. The fact that a base backup was started (and
thus a checkpoint completed much earlier than it would have otherwise)
shouldn't make our system assume that the overall behaviour is quite different
going forward.


> Now, I'm trying to imagine how this would interact in a meaningful way
> with opportunistic freezing behavior during vacuum.
>
> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,

Fwiw, I wouldn't say that unfreezing 20% of recently frozen pages is a low
value.


>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

I'm thinking something very roughly along these lines could make sense:

page_lsn_age = insert_lsn - page_lsn;

if (dirty && !fpi)
{
   /*
    * If we can freeze without an FPI, be quite agressive about
    * opportunistically freezing. We just need to prevent freezing
    * when the table is constantly being rewritten. It's ok to make mistakes
    * initially - the rate of unfreezes will quickly stop us from making
    * mistakes as often.
    */
#define NO_FPI_FREEZE_FACTOR 10.0
   if (page_lsn_age >
       average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * NO_FPI_FREEZE_FACTOR)
      freeze = true;
}
else
{
   /*
    * Freezing would emit an FPI and/or dirty the page, making freezing quite
    * a bit more costly. Be more hesitant about freezing recently modified
    * data, unless it's very rare that we unfreeze recently modified data.
    * For insert-only/mostly tables, unfreezes should be rare, so we'll still
    * freeze most of the time.
    */
#define FPI_FREEZE_FACTOR 1
   if (page_lsn_age >
       average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * FPI_FREEZE_FACTOR)
       freeze = true;
}

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)
Next
From: Andres Freund
Date:
Subject: Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()