Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Add LSN <-> time conversion functionality |
Date | |
Msg-id | CAAKRu_bw7Pgw8Mi9LJrBkFvPPHgvVjPphrT8ugbzs-2V0f+1Rw@mail.gmail.com Whole thread Raw |
In response to | Re: Add LSN <-> time conversion functionality (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Add LSN <-> time conversion functionality
Re: Add LSN <-> time conversion functionality |
List | pgsql-hackers |
Thanks so much for reviewing! On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > When I first read this, I immediately started wondering if this might > use the commit timestamp stuff we already have. Because for each commit > we already store the LSN and commit timestamp, right? But I'm not sure > that would be a good match - the commit_ts serves a very special purpose > of mapping XID => (LSN, timestamp), I don't see how to make it work for > (LSN=>timestmap) and (timestamp=>LSN) very easily. I took a look at the code in commit_ts.c, and I really don't see a way of reusing any of this commit<->timestamp infrastructure for timestamp<->LSN mappings. > As for the inner workings of the patch, my understanding is this: > > - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts) > points, but those points are really "buckets" that grow larger and > larger for older periods of time. Yes, they are buckets in the sense that they represent multiple values but each contains a single LSNTime value which is the minimum of all the LSNTimes we "merged" into that single array element. In order to represent a range of time, you need to use two array elements. The linear interpolation from time <-> LSN is all done with two elements. > - AFAIK each entry represent an interval of time, and the next (older) > interval is twice as long, right? So the first interval is 1 second, > then 2 seconds, 4 seconds, 8 seconds, ... > > - But I don't understand how the LSNTimeline entries are "aging" and get > less accurate, while the "current" bucket is short. lsntime_insert() > seems to simply move to the next entry, but doesn't that mean we insert > the entries into larger and larger buckets? Because the earlier array elements can represent fewer logical members than later ones and because elements are merged into the next element when space runs out, later array elements will contain older data and more of it, so those "ranges" will be larger. But, after thinking about it and also reading your feedback, I realized my algorithm was silly because it starts merging logical members before it has even used the whole array. The attached v3 has a new algorithm. Now, LSNTimes are added from the end of the array forward until all array elements have at least one logical member (array length == volume). Once array length == volume, new LSNTimes will result in merging logical members in existing elements. We want to merge older members because those can be less precise. So, the number of logical members per array element will always monotonically increase starting from the beginning of the array (which contains the most recent data) and going to the end. We want to use all the available space in the array. That means that each LSNTime insertion will always result in a single merge. We want the timeline to be inclusive of the oldest data, so merging means taking the smaller value of two LSNTime values. I had to pick a rule for choosing which elements to merge. So, I choose the merge target as the oldest element whose logical membership is < 2x its predecessor. I merge the merge target's predecessor into the merge target. Then I move all of the intervening elements down 1. Then I insert the new LSNTime at index 0. > - The comments never really spell what amount of time the entries cover > / how granular it is. My understanding is it's simply measured in number > of entries added, which is assumed to be constant and drive by > bgwriter_delay, right? Which is 200ms by default. Which seems fine, but > isn't the hibernation (HIBERNATE_FACTOR) going to mess with it? > > Is there some case where bgwriter would just loop without sleeping, > filling the timeline much faster? (I can't think of any, but ...) bgwriter will wake up when there are buffers to flush, which is likely correlated with there being new LSNs. So, actually it seems like it might work well to rely on only filling the timeline when there are things for bgwriter to do. > - The LSNTimeline comment claims an array of size 64 is large enough to > not need to care about filling it, but maybe it should briefly explain > why we can never fill it (I guess 2^64 is just too many). The new structure fits a different number of members. I have yet to calculate that number, but it should be explained in the comments once I do. For example, if we made an LSNTimeline with volume 4, once every element had one LSNTime and we needed to start merging, the following is how many logical members each element would have after each of four merges: 1111 1112 1122 1114 1124 So, if we store the number of members as an unsigned 64-bit int and we have an LSNTimeline with volume 64, what is the maximum number of members can we store if we hold all of the invariants described in my algorithm above (we only merge when required, every element holds < 2x the number of logical members as its predecessor, we do exactly one merge every insertion [when required], membership must monotonically increase [choose the oldest element meeting the criteria when deciding what to merge])? > - I don't quite understand why 0005 adds the functions to pageinspect. > This has nothing to do with pages, right? You're right. I just couldn't think of a good place to put the functions. In version 3, I just put the SQL functions in pgstat_wal.c and made them generally available (i.e. not in a contrib module). I haven't added docs back yet. But perhaps a section near the docs describing pg_xact_commit_timestamp() [1]? I wasn't sure if I should put the SQL function source code in pgstatfuncs.c -- I kind of prefer it in pgstat_wal.c but there are no other SQL functions there. > - Not sure why we need 0001. Just so that the "estimate" functions in > 0002 have a convenient "start" point? Surely we could look at the > current LSNTimeline data and use the oldest value, or (if there's no > data) use the current timestamp/LSN? When there are 0 or 1 entries in the timeline you'll get an answer that could be very off if you just return the current timestamp or LSN. I guess that is okay? > - I wonder what happens if we lose the data - we know that if people > reset statistics for whatever reason (or just lose them because of a > crash, or because they're on a replica), bad things happen to > autovacuum. What's the (expected) impact on pruning? This is an important question. Because stats aren't crashsafe, we could return very inaccurate numbers for some time/LSN values if we crash. I don't actually know what we could do about that. When I use the LSNTimeline for the freeze heuristic it is less of an issue because the freeze heuristic has a fallback strategy when there aren't enough stats to do its calculations. But other users of this LSNTimeline will simply get highly inaccurate results (I think?). Is there anything we could do about this? It seems bad. Andres had brought up something at some point about, what if the database is simply turned off for awhile and then turned back on. Even if you cleanly shut down, will there be "gaps" in the timeline? I think that could be okay, but it is something to think about. > - What about a SRF function that outputs the whole LSNTimeline? Would be > useful for debugging / development, I think. (Just a suggestion). Good idea! I've added this. Though, maybe there was a simpler way to implement than I did. Just a note, all of my comments could use a lot of work, but I want to get consensus on the algorithm before I make sure and write about it in a perfect way. - Melanie [1] https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-COMMIT-TIMESTAMP
Attachment
pgsql-hackers by date: