Re: Enhancing Memory Context Statistics Reporting - Mailing list pgsql-hackers

From Rahila Syed
Subject Re: Enhancing Memory Context Statistics Reporting
Date
Msg-id CAH2L28vAxcePsqV1AbjYeU4QAojyXS5M39d5MVGCSgAcoNybkQ@mail.gmail.com
Whole thread Raw
In response to Re: Enhancing Memory Context Statistics Reporting  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi,

On Sat, Jan 25, 2025 at 3:50 AM Tomas Vondra <tomas@vondra.me> wrote:


On 1/24/25 14:47, Rahila Syed wrote:
>
> Hi,
>
>
>     Just idea; as an another option, how about blocking new requests to
>     the target process (e.g., causing them to fail with an error or
>     returning NULL with a warning) if a previous request is still pending?
>     Users can simply retry the request if it fails. IMO failing quickly
>     seems preferable to getting stuck for a while in cases with concurrent
>     requests.
>
> Thank you for the suggestion. I agree that it is better to fail
> early and avoid waiting for a timeout in such cases. I will add a
> "pending request" tracker for this in shared memory. This approach
> will help prevent sending a concurrent request if a request for the
> same backend is still being processed.
>

AFAIK these failures should be extremely rare - we're only talking about
that because the workload I used for testing is highly concurrent, i.e.
it requests memory context info extremely often. I doubt anyone sane is
going to do that in practice ...
Yes, that makes sense. 
 
> IMO, one downside of throwing an error in such cases is that the
> users might wonder if they need to take a corrective action, even
> though the issue is actually going to solve itself and they just
> need to retry. Therefore, issuing a warning or displaying previously
> updated statistics might be a better alternative to throwing an
> error.
>

Wouldn't this be mostly mitigated by adding proper detail/hint to the
error message? Sure, the user can always ignore that (especially when
calling this from a script), but well ... we can only do so much.
 
 OK.

All this makes me think about how we shared pgstat data before the shmem
approach was introduced in PG15. Until then the process signaled pgstat
collector, and the collector wrote the statistics into a file, with a
timestamp. And the process used the timestamp to decide if it's fresh
enough ... Wouldn't the same approach work here?

I imagined it would work something like this:

requesting backend:
-------------------
* set request_ts to current timestamp
* signal the target process, to generate memory context info
* wait until the DSA gets filled with stats_ts > request_ts
* return the data, don't erase anything

target backend
--------------
* clear the signal
* generate the statistics
* set stats_ts to current timestamp
* wait all the backends waiting for the stats (through CV)

I see v11 does almost this, except that it accepts somewhat stale data.
That's correct.
 
But why would that be necessary? I don't think it's needed, and I don't
think we should accept data from before the process sends the signal.

This is done in an attempt to avoid concurrent requests from timing out.
In such cases, data in response to another request is likely to already be in the 
dynamic shared memory. Hence instead of waiting for the latest data and risking a 
timeout, the approach displays available statistics that are newer than a defined
threshold. Additionally, since we can't distinguish between sequential and 
concurrent requests, we accept somewhat stale data for all requests.

I realize this approach has some issues, mainly regarding how to determine
an appropriate threshold value or a limit for old data.

Therefore, I agree that it makes sense to display the data that is published 
after the request is made. If such data can't be published due to concurrent 
requests or other delays, the function should detect this and return as soon as 
possible.


Thank you,
Rahila Syed

pgsql-hackers by date:

Previous
From: "Vitaly Davydov"
Date:
Subject: Re: An improvement of ProcessTwoPhaseBuffer logic
Next
From: Bertrand Drouvot
Date:
Subject: Re: per backend WAL statistics