Re: New strategies for freezing, advancing relfrozenxid early - Mailing list pgsql-hackers

From Andres Freund
Subject Re: New strategies for freezing, advancing relfrozenxid early
Date
Msg-id 20230127183622.pktggu7iexd7jzyh@awork3.anarazel.de
Whole thread Raw
In response to Re: New strategies for freezing, advancing relfrozenxid early  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2023-01-27 12:53:58 -0500, Robert Haas wrote:
> On Fri, Jan 27, 2023 at 12:58 AM Andres Freund <andres@anarazel.de> wrote:
> > Essentially the "any fpi" logic is a very coarse grained way of using the page
> > LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> > since the last write" is a good metric to avoid unnecessary freezing - it's
> > too coarse. But I think using the LSN is the right thought. What about
> > something like
> >
> >   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
> >   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
> >      FreezeMe();
> >
> > I probably got some details wrong, what I am going for with lsn_threshold is
> > that we'd freeze an already dirty page if it's not been updated within 10% of
> > the LSN distance to the last VACUUM.
>
> I think this might not be quite the right idea for a couple of reasons.

It's definitely not perfect.  If we had an approximate LSN->time map as
general infrastructure, we could do a lot better. I think it'd be reasonably
easy to maintain that in the autovacuum launcher, for example.


One thing worth calling out here, because it's not obvious from the code
quoted above in isolation, is that what I was trying to refine here was the
decision when to perform opportunistic freezing *of already dirty pages that
do not require an FPI*.

So all that we need to prevent here is freezing very hotly updated data, where
the WAL overhead of the freeze records would be noticable, because we
constantly VACUUM, due to the high turnover.


> First, suppose that the table is being processed just by autovacuum
> (no manual VACUUM operations) and that the rate of WAL generation is
> pretty even, so that LSN age is a good proxy for time. If autovacuum
> processes the table once per hour, this will freeze if it hasn't been
> updated in the last six minutes. That sounds good. But if autovacuum
> processes the table once per day, then this will freeze if it hasn't
> been updated in 2.4 hours. That might be OK, but it sounds a little on
> the long side.

You're right. I was thinking of the "lsn_since_last_vacuum" because I was
posulating it being useful elsewhere in the thread (but for eager strategy
logic) - but here that's really not very relevant.

Given that we're dealing with already dirty pages not requiring an FPI, I
think a much better "reference LSN" would be the LSN of the last checkpoint
(LSN of the last checkpoint record, not the current REDO pointer).


> Second, and more seriously, I think this would, in some circumstances,
> lead to tremendously unstable behavior. Suppose somebody does a bunch
> of work on a table and then they're like "oh, we should clean up,
> VACUUM" and it completes quickly because it's been a while since the
> last vacuum and so it doesn't freeze much. Then, for whatever reason,
> they decide to run it one more time, and it goes bananas and starts
> freezing all kinds of stuff because the LSN distance since the last
> vacuum is basically zero. Or equally, you run a manual VACUUM, and you
> get completely different behavior depending on how long it's been
> since the last autovacuum ran.

I don't think this quite applies to the scenario at hand, because it's
restricted to already dirty pages. And the max increased overhead is also
small due to that - so occasionally getting it wrong is that impactful.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: improving user.c error messages
Next
From: Peter Geoghegan
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early