Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add LSN <-> time conversion functionality
Date
Msg-id 0747454d-5204-4363-9098-bcce5236cf1d@enterprisedb.com
Whole thread Raw
In response to Re: Add LSN <-> time conversion functionality  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Add LSN <-> time conversion functionality
List pgsql-hackers

On 2/22/24 03:45, Melanie Plageman wrote:
> Thanks so much for reviewing!
> 
> On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> When I first read this, I immediately started wondering if this might
>> use the commit timestamp stuff we already have. Because for each commit
>> we already store the LSN and commit timestamp, right? But I'm not sure
>> that would be a good match - the commit_ts serves a very special purpose
>> of mapping XID => (LSN, timestamp), I don't see how to make it work for
>> (LSN=>timestmap) and (timestamp=>LSN) very easily.
> 
> I took a look at the code in commit_ts.c, and I really don't see a way
> of reusing any of this commit<->timestamp infrastructure for
> timestamp<->LSN mappings.
> 
>> As for the inner workings of the patch, my understanding is this:
>>
>> - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
>> points, but those points are really "buckets" that grow larger and
>> larger for older periods of time.
> 
> Yes, they are buckets in the sense that they represent multiple values
> but each contains a single LSNTime value which is the minimum of all
> the LSNTimes we "merged" into that single array element. In order to
> represent a range of time, you need to use two array elements. The
> linear interpolation from time <-> LSN is all done with two elements.
> 
>> - AFAIK each entry represent an interval of time, and the next (older)
>> interval is twice as long, right? So the first interval is 1 second,
>> then 2 seconds, 4 seconds, 8 seconds, ...
>>
>> - But I don't understand how the LSNTimeline entries are "aging" and get
>> less accurate, while the "current" bucket is short. lsntime_insert()
>> seems to simply move to the next entry, but doesn't that mean we insert
>> the entries into larger and larger buckets?
> 
> Because the earlier array elements can represent fewer logical members
> than later ones and because elements are merged into the next element
> when space runs out, later array elements will contain older data and
> more of it, so those "ranges" will be larger. But, after thinking
> about it and also reading your feedback, I realized my algorithm was
> silly because it starts merging logical members before it has even
> used the whole array.
> 
> The attached v3 has a new algorithm. Now, LSNTimes are added from the
> end of the array forward until all array elements have at least one
> logical member (array length == volume). Once array length == volume,
> new LSNTimes will result in merging logical members in existing
> elements. We want to merge older members because those can be less
> precise. So, the number of logical members per array element will
> always monotonically increase starting from the beginning of the array
> (which contains the most recent data) and going to the end. We want to
> use all the available space in the array. That means that each LSNTime
> insertion will always result in a single merge. We want the timeline
> to be inclusive of the oldest data, so merging means taking the
> smaller value of two LSNTime values. I had to pick a rule for choosing
> which elements to merge. So, I choose the merge target as the oldest
> element whose logical membership is < 2x its predecessor. I merge the
> merge target's predecessor into the merge target. Then I move all of
> the intervening elements down 1. Then I insert the new LSNTime at
> index 0.
> 

I can't help but think about t-digest [1], which also merges data into
variable-sized buckets (called centroids, which is a pair of values,
just like LSNTime). But the merging is driven by something called "scale
function" which I found like a pretty nice approach to this, and it
yields some guarantees regarding accuracy. I wonder if we could do
something similar here ...

The t-digest is a way to approximate quantiles, and the default scale
function is optimized for best accuracy on the extremes (close to 0.0
and 1.0), but it's possible to use scale functions that optimize only
for accuracy close to 1.0.

This may be misguided, but I see similarity between quantiles and what
LSNTimeline does - timestamps are ordered, and quantiles close to 0.0
are "old timestamps" while quantiles close to 1.0 are "now".

And t-digest also defines a pretty efficient algorithm to merge data in
a way that gradually combines older "buckets" into larger ones.

>> - The comments never really spell what amount of time the entries cover
>> / how granular it is. My understanding is it's simply measured in number
>> of entries added, which is assumed to be constant and drive by
>> bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
>> isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?
>>
>> Is there some case where bgwriter would just loop without sleeping,
>> filling the timeline much faster? (I can't think of any, but ...)
> 
> bgwriter will wake up when there are buffers to flush, which is likely
> correlated with there being new LSNs. So, actually it seems like it
> might work well to rely on only filling the timeline when there are
> things for bgwriter to do.
> 
>> - The LSNTimeline comment claims an array of size 64 is large enough to
>> not need to care about filling it, but maybe it should briefly explain
>> why we can never fill it (I guess 2^64 is just too many).
> 
> The new structure fits a different number of members. I have yet to
> calculate that number, but it should be explained in the comments once
> I do.
> 
> For example, if we made an LSNTimeline with volume 4, once every
> element had one LSNTime and we needed to start merging, the following
> is how many logical members each element would have after each of four
> merges:
> 1111
> 1112
> 1122
> 1114
> 1124
> So, if we store the number of members as an unsigned 64-bit int and we
> have an LSNTimeline with volume 64, what is the maximum number of
> members can we store if we hold all of the invariants described in my
> algorithm above (we only merge when required, every element holds < 2x
> the number of logical members as its predecessor, we do exactly one
> merge every insertion [when required], membership must monotonically
> increase [choose the oldest element meeting the criteria when deciding
> what to merge])?
> 

I guess that should be enough for (2^64-1) logical members, because it's
a sequence 1, 2, 4, 8, ..., 2^63. Seems enough.

But now that I think about it, does it make sense to do the merging
based on the number of logical members? Shouldn't this really be driven
by the "length" of the time interval the member covers?

>> - I don't quite understand why 0005 adds the functions to pageinspect.
>> This has nothing to do with pages, right?
> 
> You're right. I just couldn't think of a good place to put the
> functions. In version 3, I just put the SQL functions in pgstat_wal.c
> and made them generally available (i.e. not in a contrib module). I
> haven't added docs back yet. But perhaps a section near the docs
> describing pg_xact_commit_timestamp() [1]? I wasn't sure if I should
> put the SQL function source code in pgstatfuncs.c -- I kind of prefer
> it in pgstat_wal.c but there are no other SQL functions there.
> 

OK, pgstat_wal seems like a good place.

>> - Not sure why we need 0001. Just so that the "estimate" functions in
>> 0002 have a convenient "start" point? Surely we could look at the
>> current LSNTimeline data and use the oldest value, or (if there's no
>> data) use the current timestamp/LSN?
> 
> When there are 0 or 1 entries in the timeline you'll get an answer
> that could be very off if you just return the current timestamp or
> LSN. I guess that is okay?
> 
>> - I wonder what happens if we lose the data - we know that if people
>> reset statistics for whatever reason (or just lose them because of a
>> crash, or because they're on a replica), bad things happen to
>> autovacuum. What's the (expected) impact on pruning?
> 
> This is an important question. Because stats aren't crashsafe, we
> could return very inaccurate numbers for some time/LSN values if we
> crash. I don't actually know what we could do about that. When I use
> the LSNTimeline for the freeze heuristic it is less of an issue
> because the freeze heuristic has a fallback strategy when there aren't
> enough stats to do its calculations. But other users of this
> LSNTimeline will simply get highly inaccurate results (I think?). Is
> there anything we could do about this? It seems bad.
> 
> Andres had brought up something at some point about, what if the
> database is simply turned off for awhile and then turned back on. Even
> if you cleanly shut down, will there be "gaps" in the timeline? I
> think that could be okay, but it is something to think about.
> 
>> - What about a SRF function that outputs the whole LSNTimeline? Would be
>> useful for debugging / development, I think. (Just a suggestion).
> 
> Good idea! I've added this. Though, maybe there was a simpler way to
> implement than I did.
> 

Thanks. I'll take a look.

> Just a note, all of my comments could use a lot of work, but I want to
> get consensus on the algorithm before I make sure and write about it
> in a perfect way.
> 

Makes sense, as long as the comments are sufficiently clear. It's hard
to reach consensus on something not explained clearly enough.


regards


[1]
https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "Amonson, Paul D"
Date:
Subject: RE: Popcount optimization using AVX512
Next
From: Nathan Bossart
Date:
Subject: Re: Popcount optimization using AVX512