Home > mailing lists

Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Add LSN <-> time conversion functionality
Date	August 9, 2024 13:09:13
Msg-id	9eb3132e-eef6-48b4-8244-dad8b29c1953@vondra.me Whole thread Raw
In response to	Re: Add LSN <-> time conversion functionality (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: Add LSN <-> time conversion functionality
List	pgsql-hackers

Tree view


On 8/9/24 03:02, Melanie Plageman wrote:
> On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 8/7/24 21:39, Melanie Plageman wrote:
>>> On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>>>
>>>> As I mentioned to you off-list, I feel like this needs some sort of
>>>> recency bias. Certainly vacuum, and really almost any conceivable user
>>>> of this facility, is going to care more about accurate answers for new
>>>> data than for old data. If there's no recency bias, then I think that
>>>> eventually answers for more recent LSNs will start to become less
>>>> accurate, since they've got to share the data structure with more and
>>>> more time from long ago. I don't think you've done anything about this
>>>> in this version of the patch, but I might be wrong.
>>>
>>> That makes sense. This version of the patch set doesn't have a recency
>>> bias implementation. I plan to work on it but will need to do the
>>> testing like you mentioned.
>>>
>>
>> I agree that it's likely we probably want more accurate results for
>> recent data, so some recency bias makes sense - for example for the
>> eager vacuuming that's definitely true.
>>
>> But this was initially presented as a somewhat universal LSN/timestamp
>> mapping, and in that case it might make sense to minimize the average
>> error - which I think is what lsntime_to_drop() currently does, by
>> calculating the "area" etc.
>>
>> Maybe it'd be good to approach this from the opposite direction, say
>> what "accuracy guarantees" we want to provide, and then design the
>> structure / algorithm to ensure that. Otherwise we may end up with an
>> infinite discussion about algorithms with unclear idea which one is the
>> best choice.
>>
>> And I'm sure "users" of the LSN/Timestamp mapping may get confused about
>> what to expect, without reasonably clear guarantees.
>>
>> For example, it seems to me a "good" accuracy guarantee would be:
>>
>>    Given a LSN, the age of the returned timestamp is less than 10% off
>>    the actual timestamp. The timestamp precision is in seconds.
>>
>> This means that if LSN was written 100 seconds ago, it would be OK to
>> get an answer in the 90-110 seconds range. For LSN from 1h ago, the
>> acceptable range would be 3600s +/- 360s. And so on. The 10% is just
>> arbitrary, maybe it should be lower - doesn't matter much.
> 
> I changed this patch a bit to only provide ranges with an upper and
> lower bound from the SQL callable functions. While the size of the
> range provided could be part of our "accuracy guarantee", I'm not sure
> if we have to provide that.
> 

I wouldn't object to providing the timestamp range, along with the
estimate. That seems potentially quite useful for other use cases - it
provides a very clear guarantee.

The thing that concerns me a bit is that maybe it's an implementation
detail. I mean, we might choose to rework the structure in a way that
does not track the ranges like this ... Doesn't seem likely, though.

>> How could we do this? We have 1s precision, so we start with buckets for
>> each seconds. And we want to allow merging stuff nicely. The smallest
>> merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
>> 1s -> 10s -> 100s -> 1000s instead.
>>
>> So we start with 100x one-second buckets
>>
>> [A_0, A_1, ..., A_99]  -> 100 x 1s buckets
>> [B_0, B_1, ..., B_99]  -> 100 x 10s buckets
>> [C_0, C_1, ..., C_99]  -> 100 x 100s buckets
>> [D_0, D_1, ..., D_99]  -> 100 x 1000s buckets
>>
>> We start by adding data into A_k buckets. After filling all 100 of them,
>> we grab the oldest 10 buckets, and combine/move them into B_k. And so
>> on, until B is gets full too. Then we grab the 10 oldest B_k entries,
>> and move them into C. and so on. For D the oldest entries would get
>> discarded, or we could add another layer with each bucket representing
>> 10k seconds.
> 
> I originally had an algorithm that stored old values somewhat like
> this (each element stored 2x logical members of the preceding
> element). When I was testing algorithms, I abandoned this method
> because it was less accurate than the method which calculates the
> interpolation error "area". But, this would be expected -- it would be
> less accurate for older values.
> 
> I'm currently considering an algorithm that uses a combination of the
> interpolation error and the age of the point. I'm thinking of adding
> to or dividing the error of each point by "now - that point's time (or
> lsn)". This would lead me to be more likely to drop points that are
> older.
> 
> This is a bit different than "combining" buckets, but it seems like it
> might allow us to drop unneeded recent points when they are very
> regular.
> 

TBH I'm a bit lost in how the various patch versions merge the data.
Maybe there is a perfect algorithm, keeping a perfectly accurate
approximation in the smallest space, but does that really matter? If we
needed to keep many instances / very long history, maybe it's matter.

But we need one instance, and we seem to agree it's enough to have a
couple days of history at most. And even the super wasteful struct I
described above would only need ~8kB for that.

I suggest we do the simplest and most obvious algorithm possible, at
least for now. Focusing on this part seems like a distraction from the
freezing thing you actually want to do.

>> Isn't this a sign this does not quite fit into pgstats? Even if this
>> happens to deal with unsafe restarts, replica promotions and so on, what
>> if the user just does pg_stat_reset? That already causes trouble because
>> we simply forget deleted/updated/inserted tuples. If we also forget data
>> used for freezing heuristics, that does not seem great ...
>>
>> Wouldn't it be better to write this into WAL as part of a checkpoint (or
>> something like that?), and make bgwriter to not only add LSN/timestamp
>> into the stream, but also write it into WAL. It's already waking up, on
>> idle systems ~32B written to WAL does not matter, and on busy system
>> it's just noise.
> 
> I was imagining adding a new type of WAL record that contains just the
> LSN and time and writing it out in bgwriter. Is that not what you are
> thinking?
> 

Now sure, I was thinking we would do two things:

1) bgwriter writes the (LSN,timestamp) into WAL, and also updates the
   in-memory struct

2) during checkpoint we flush the in-memory struct to disk, so that we
   have it after restart / crash

I haven't thought about this very much, but I think this would address
both the crash/recovery/restart on the primary, and on replicas.


regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Tomas Vondra
Date: 09 August 2024, 12:52:49
Subject: Re: Add LSN <-> time conversion functionality

From: Melanie Plageman
Date: 09 August 2024, 13:09:27
Subject: Re: Add LSN <-> time conversion functionality

Re: Add LSN <-> time conversion functionality - Mailing list pgsql-hackers

Previous

Next