Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Add LSN <-> time conversion functionality |
Date | |
Msg-id | 9eb3132e-eef6-48b4-8244-dad8b29c1953@vondra.me Whole thread Raw |
In response to | Re: Add LSN <-> time conversion functionality (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Add LSN <-> time conversion functionality
|
List | pgsql-hackers |
On 8/9/24 03:02, Melanie Plageman wrote: > On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 8/7/24 21:39, Melanie Plageman wrote: >>> On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote: >>>> >>>> As I mentioned to you off-list, I feel like this needs some sort of >>>> recency bias. Certainly vacuum, and really almost any conceivable user >>>> of this facility, is going to care more about accurate answers for new >>>> data than for old data. If there's no recency bias, then I think that >>>> eventually answers for more recent LSNs will start to become less >>>> accurate, since they've got to share the data structure with more and >>>> more time from long ago. I don't think you've done anything about this >>>> in this version of the patch, but I might be wrong. >>> >>> That makes sense. This version of the patch set doesn't have a recency >>> bias implementation. I plan to work on it but will need to do the >>> testing like you mentioned. >>> >> >> I agree that it's likely we probably want more accurate results for >> recent data, so some recency bias makes sense - for example for the >> eager vacuuming that's definitely true. >> >> But this was initially presented as a somewhat universal LSN/timestamp >> mapping, and in that case it might make sense to minimize the average >> error - which I think is what lsntime_to_drop() currently does, by >> calculating the "area" etc. >> >> Maybe it'd be good to approach this from the opposite direction, say >> what "accuracy guarantees" we want to provide, and then design the >> structure / algorithm to ensure that. Otherwise we may end up with an >> infinite discussion about algorithms with unclear idea which one is the >> best choice. >> >> And I'm sure "users" of the LSN/Timestamp mapping may get confused about >> what to expect, without reasonably clear guarantees. >> >> For example, it seems to me a "good" accuracy guarantee would be: >> >> Given a LSN, the age of the returned timestamp is less than 10% off >> the actual timestamp. The timestamp precision is in seconds. >> >> This means that if LSN was written 100 seconds ago, it would be OK to >> get an answer in the 90-110 seconds range. For LSN from 1h ago, the >> acceptable range would be 3600s +/- 360s. And so on. The 10% is just >> arbitrary, maybe it should be lower - doesn't matter much. > > I changed this patch a bit to only provide ranges with an upper and > lower bound from the SQL callable functions. While the size of the > range provided could be part of our "accuracy guarantee", I'm not sure > if we have to provide that. > I wouldn't object to providing the timestamp range, along with the estimate. That seems potentially quite useful for other use cases - it provides a very clear guarantee. The thing that concerns me a bit is that maybe it's an implementation detail. I mean, we might choose to rework the structure in a way that does not track the ranges like this ... Doesn't seem likely, though. >> How could we do this? We have 1s precision, so we start with buckets for >> each seconds. And we want to allow merging stuff nicely. The smallest >> merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do >> 1s -> 10s -> 100s -> 1000s instead. >> >> So we start with 100x one-second buckets >> >> [A_0, A_1, ..., A_99] -> 100 x 1s buckets >> [B_0, B_1, ..., B_99] -> 100 x 10s buckets >> [C_0, C_1, ..., C_99] -> 100 x 100s buckets >> [D_0, D_1, ..., D_99] -> 100 x 1000s buckets >> >> We start by adding data into A_k buckets. After filling all 100 of them, >> we grab the oldest 10 buckets, and combine/move them into B_k. And so >> on, until B is gets full too. Then we grab the 10 oldest B_k entries, >> and move them into C. and so on. For D the oldest entries would get >> discarded, or we could add another layer with each bucket representing >> 10k seconds. > > I originally had an algorithm that stored old values somewhat like > this (each element stored 2x logical members of the preceding > element). When I was testing algorithms, I abandoned this method > because it was less accurate than the method which calculates the > interpolation error "area". But, this would be expected -- it would be > less accurate for older values. > > I'm currently considering an algorithm that uses a combination of the > interpolation error and the age of the point. I'm thinking of adding > to or dividing the error of each point by "now - that point's time (or > lsn)". This would lead me to be more likely to drop points that are > older. > > This is a bit different than "combining" buckets, but it seems like it > might allow us to drop unneeded recent points when they are very > regular. > TBH I'm a bit lost in how the various patch versions merge the data. Maybe there is a perfect algorithm, keeping a perfectly accurate approximation in the smallest space, but does that really matter? If we needed to keep many instances / very long history, maybe it's matter. But we need one instance, and we seem to agree it's enough to have a couple days of history at most. And even the super wasteful struct I described above would only need ~8kB for that. I suggest we do the simplest and most obvious algorithm possible, at least for now. Focusing on this part seems like a distraction from the freezing thing you actually want to do. >> Isn't this a sign this does not quite fit into pgstats? Even if this >> happens to deal with unsafe restarts, replica promotions and so on, what >> if the user just does pg_stat_reset? That already causes trouble because >> we simply forget deleted/updated/inserted tuples. If we also forget data >> used for freezing heuristics, that does not seem great ... >> >> Wouldn't it be better to write this into WAL as part of a checkpoint (or >> something like that?), and make bgwriter to not only add LSN/timestamp >> into the stream, but also write it into WAL. It's already waking up, on >> idle systems ~32B written to WAL does not matter, and on busy system >> it's just noise. > > I was imagining adding a new type of WAL record that contains just the > LSN and time and writing it out in bgwriter. Is that not what you are > thinking? > Now sure, I was thinking we would do two things: 1) bgwriter writes the (LSN,timestamp) into WAL, and also updates the in-memory struct 2) during checkpoint we flush the in-memory struct to disk, so that we have it after restart / crash I haven't thought about this very much, but I think this would address both the crash/recovery/restart on the primary, and on replicas. regards -- Tomas Vondra
pgsql-hackers by date: