Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Add LSN <-> time conversion functionality
Date
Msg-id CAAKRu_bw7Pgw8Mi9LJrBkFvPPHgvVjPphrT8ugbzs-2V0f+1Rw@mail.gmail.com
Whole thread Raw
In response to Re: Add LSN <-> time conversion functionality  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Add LSN <-> time conversion functionality
Re: Add LSN <-> time conversion functionality
List pgsql-hackers
Thanks so much for reviewing!

On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> When I first read this, I immediately started wondering if this might
> use the commit timestamp stuff we already have. Because for each commit
> we already store the LSN and commit timestamp, right? But I'm not sure
> that would be a good match - the commit_ts serves a very special purpose
> of mapping XID => (LSN, timestamp), I don't see how to make it work for
> (LSN=>timestmap) and (timestamp=>LSN) very easily.

I took a look at the code in commit_ts.c, and I really don't see a way
of reusing any of this commit<->timestamp infrastructure for
timestamp<->LSN mappings.

> As for the inner workings of the patch, my understanding is this:
>
> - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
> points, but those points are really "buckets" that grow larger and
> larger for older periods of time.

Yes, they are buckets in the sense that they represent multiple values
but each contains a single LSNTime value which is the minimum of all
the LSNTimes we "merged" into that single array element. In order to
represent a range of time, you need to use two array elements. The
linear interpolation from time <-> LSN is all done with two elements.

> - AFAIK each entry represent an interval of time, and the next (older)
> interval is twice as long, right? So the first interval is 1 second,
> then 2 seconds, 4 seconds, 8 seconds, ...
>
> - But I don't understand how the LSNTimeline entries are "aging" and get
> less accurate, while the "current" bucket is short. lsntime_insert()
> seems to simply move to the next entry, but doesn't that mean we insert
> the entries into larger and larger buckets?

Because the earlier array elements can represent fewer logical members
than later ones and because elements are merged into the next element
when space runs out, later array elements will contain older data and
more of it, so those "ranges" will be larger. But, after thinking
about it and also reading your feedback, I realized my algorithm was
silly because it starts merging logical members before it has even
used the whole array.

The attached v3 has a new algorithm. Now, LSNTimes are added from the
end of the array forward until all array elements have at least one
logical member (array length == volume). Once array length == volume,
new LSNTimes will result in merging logical members in existing
elements. We want to merge older members because those can be less
precise. So, the number of logical members per array element will
always monotonically increase starting from the beginning of the array
(which contains the most recent data) and going to the end. We want to
use all the available space in the array. That means that each LSNTime
insertion will always result in a single merge. We want the timeline
to be inclusive of the oldest data, so merging means taking the
smaller value of two LSNTime values. I had to pick a rule for choosing
which elements to merge. So, I choose the merge target as the oldest
element whose logical membership is < 2x its predecessor. I merge the
merge target's predecessor into the merge target. Then I move all of
the intervening elements down 1. Then I insert the new LSNTime at
index 0.

> - The comments never really spell what amount of time the entries cover
> / how granular it is. My understanding is it's simply measured in number
> of entries added, which is assumed to be constant and drive by
> bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
> isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?
>
> Is there some case where bgwriter would just loop without sleeping,
> filling the timeline much faster? (I can't think of any, but ...)

bgwriter will wake up when there are buffers to flush, which is likely
correlated with there being new LSNs. So, actually it seems like it
might work well to rely on only filling the timeline when there are
things for bgwriter to do.

> - The LSNTimeline comment claims an array of size 64 is large enough to
> not need to care about filling it, but maybe it should briefly explain
> why we can never fill it (I guess 2^64 is just too many).

The new structure fits a different number of members. I have yet to
calculate that number, but it should be explained in the comments once
I do.

For example, if we made an LSNTimeline with volume 4, once every
element had one LSNTime and we needed to start merging, the following
is how many logical members each element would have after each of four
merges:
1111
1112
1122
1114
1124
So, if we store the number of members as an unsigned 64-bit int and we
have an LSNTimeline with volume 64, what is the maximum number of
members can we store if we hold all of the invariants described in my
algorithm above (we only merge when required, every element holds < 2x
the number of logical members as its predecessor, we do exactly one
merge every insertion [when required], membership must monotonically
increase [choose the oldest element meeting the criteria when deciding
what to merge])?

> - I don't quite understand why 0005 adds the functions to pageinspect.
> This has nothing to do with pages, right?

You're right. I just couldn't think of a good place to put the
functions. In version 3, I just put the SQL functions in pgstat_wal.c
and made them generally available (i.e. not in a contrib module). I
haven't added docs back yet. But perhaps a section near the docs
describing pg_xact_commit_timestamp() [1]? I wasn't sure if I should
put the SQL function source code in pgstatfuncs.c -- I kind of prefer
it in pgstat_wal.c but there are no other SQL functions there.

> - Not sure why we need 0001. Just so that the "estimate" functions in
> 0002 have a convenient "start" point? Surely we could look at the
> current LSNTimeline data and use the oldest value, or (if there's no
> data) use the current timestamp/LSN?

When there are 0 or 1 entries in the timeline you'll get an answer
that could be very off if you just return the current timestamp or
LSN. I guess that is okay?

> - I wonder what happens if we lose the data - we know that if people
> reset statistics for whatever reason (or just lose them because of a
> crash, or because they're on a replica), bad things happen to
> autovacuum. What's the (expected) impact on pruning?

This is an important question. Because stats aren't crashsafe, we
could return very inaccurate numbers for some time/LSN values if we
crash. I don't actually know what we could do about that. When I use
the LSNTimeline for the freeze heuristic it is less of an issue
because the freeze heuristic has a fallback strategy when there aren't
enough stats to do its calculations. But other users of this
LSNTimeline will simply get highly inaccurate results (I think?). Is
there anything we could do about this? It seems bad.

Andres had brought up something at some point about, what if the
database is simply turned off for awhile and then turned back on. Even
if you cleanly shut down, will there be "gaps" in the timeline? I
think that could be okay, but it is something to think about.

> - What about a SRF function that outputs the whole LSNTimeline? Would be
> useful for debugging / development, I think. (Just a suggestion).

Good idea! I've added this. Though, maybe there was a simpler way to
implement than I did.

Just a note, all of my comments could use a lot of work, but I want to
get consensus on the algorithm before I make sure and write about it
in a perfect way.

- Melanie

[1] https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-COMMIT-TIMESTAMP

Attachment

pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: POC: GROUP BY optimization
Next
From: Michael Paquier
Date:
Subject: Re: Injection points: some tools to wait and wake