Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Add LSN <-> time conversion functionality
Date
Msg-id CA+TgmoZ-huAspWFJ8xP7u7bn8BDVJQ5sNjKGwXug_G6pYOHj0Q@mail.gmail.com
Whole thread Raw
In response to Re: Add LSN <-> time conversion functionality  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Add LSN <-> time conversion functionality
Re: Add LSN <-> time conversion functionality
List pgsql-hackers
On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:
> How could we do this? We have 1s precision, so we start with buckets for
> each seconds. And we want to allow merging stuff nicely. The smallest
> merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
> 1s -> 10s -> 100s -> 1000s instead.
>
> So we start with 100x one-second buckets
>
> [A_0, A_1, ..., A_99]  -> 100 x 1s buckets
> [B_0, B_1, ..., B_99]  -> 100 x 10s buckets
> [C_0, C_1, ..., C_99]  -> 100 x 100s buckets
> [D_0, D_1, ..., D_99]  -> 100 x 1000s buckets
>
> We start by adding data into A_k buckets. After filling all 100 of them,
> we grab the oldest 10 buckets, and combine/move them into B_k. And so
> on, until B is gets full too. Then we grab the 10 oldest B_k entries,
> and move them into C. and so on. For D the oldest entries would get
> discarded, or we could add another layer with each bucket representing
> 10k seconds.

Yeah, this kind of algorithm makes sense to me, although as you say
later, I don't think we need this amount of precision. I also think
you're right to point out that this provides certain guaranteed
behavior.

> A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
> (or want) to keep a longer history?

I think there is a difference of opinion about this between Melanie
and I. I feel like we should be designing something that does the
exact job we need done for the freezing stuff, and if anyone else can
use it, that's a bonus. For that, I feel that 300h is more than
plenty. The goal of the freezing stuff, roughly speaking, is to answer
the question "will this be unfrozen real soon?". "Real soon" could
arguably mean a minute or an hour, but I don't think it makes sense
for it to be a week. If we're freezing data now that has a good chance
of being unfrozen again within 7 days, we should just freeze it
anyway. The cost of freezing isn't really all that high. If we keep
freezing pages that are going to be unfrozen again within seconds or
minutes, we pay those freezing costs enough times that they become
material, but I have difficulty imagining that it ever matters if we
re-freeze the same page every week. It's OK to be wrong as long as we
aren't wrong too often, and I think that being wrong once per page per
week isn't too often.

But I think Melanie was hoping to create something more general, which
on one level is understandable, but on the other hand it's unclear
what the goals are exactly. If we limit our scope to specifically
VACUUM, we can make reasonable guesses about how much precision we
need and for how long. But a hypothetical other client of this
infrastructure could need anything at all, which makes it very unclear
what the best design is, IMHO.

> Isn't this a sign this does not quite fit into pgstats? Even if this
> happens to deal with unsafe restarts, replica promotions and so on, what
> if the user just does pg_stat_reset? That already causes trouble because
> we simply forget deleted/updated/inserted tuples. If we also forget data
> used for freezing heuristics, that does not seem great ...

+1.

> Wouldn't it be better to write this into WAL as part of a checkpoint (or
> something like that?), and make bgwriter to not only add LSN/timestamp
> into the stream, but also write it into WAL. It's already waking up, on
> idle systems ~32B written to WAL does not matter, and on busy system
> it's just noise.

I am not really sure of the best place to put this data. I agree that
pgstat doesn't feel like quite the right place. But I'm not quite sure
that putting it into every checkpoint is the right idea either.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Paul Jungwirth
Date:
Subject: Re: Which parts of src/backend/nodes/print.c are used?
Next
From: Robert Haas
Date:
Subject: Re: fix CRC algorithm in WAL reliability docs