Re: Enhancing Memory Context Statistics Reporting - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Enhancing Memory Context Statistics Reporting
Date
Msg-id 637b3ae3-b9eb-4d83-a942-5f7483911ea4@vondra.me
Whole thread Raw
In response to Re: Enhancing Memory Context Statistics Reporting  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
On 11/29/24 00:23, Rahila Syed wrote:
> Hi Tomas,
> 
> Thank you for the review.
> 
> 
> 
>     1) I read through the thread, and in general I agree with the reasoning
>     for removing the file part - it seems perfectly fine to just dump as
>     much as we can fit into a buffer, and then summarize the rest. But do we
>     need to invent a "new" limit here? The other places logging memory
>     contexts do something like this:
> 
>        MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);
> 
>     Which means we only print the 100 memory contexts at the top, and that's
>     it. Wouldn't that give us a reasonable memory limit too?
> 
> I think this prints more than 100 memory contexts, since 100 denotes the
> max_level
> and contexts at each level could have upto 100 children. This limit
> seems much higher than
> what I am currently storing in DSA which is approx. 7000 contexts.  I
> will verify this again.
>  

Yeah, you may be right. I don't remember what exactly that limit does.

> 
>     2) I see the function got renamed to pg_get_process_memory_contexts(),
>     bu the comment still says pg_get_remote_backend_memory_contexts().
> 
> Fixed 
> 
> 
>     3) I don't see any SGML docs for this new function. I was a bit unsure
>     what the "summary" argument is meant to do. The comment does not explain
>     that either.
> 
> Added docs. 
> Intention behind adding a summary argument is to report statistics of
> contexts at level 0 
> and 1 i.e TopMemoryContext and its immediate children. 
> 

OK

>     4) I wonder if the function needs to return PID. I mean, the caller
>     knows which PID it is for, so it seems rather unnecessary.
> 
> Perhaps it can be used to ascertain that the information indeed belongs to 
> the requested pid.
> 

I find that a bit ... suspicious. By this logic we'd include the input
parameters in every result, but we don't. So why is this case different?

>     5) In the "summary" mode, it might be useful to include info about how
>     many child contexts were aggregated. It's useful to know whether there
>     was 1 child or 10000 children. In the regular (non-summary) mode it'd
>     always be "1", probably, but maybe it'd interact with the limit in (1).
>     Not sure.
> 
> Sure,  I will add this in the next iteration. 
> 

OK

> 
>     6) I feel a bit uneasy about the whole locking / communication scheme.
>     In particular, I'm worried about lockups, deadlocks, etc. So I decided
>     to do a trivial stress-test - just run the new function through pgbench
>     with many clients.
> 
>     The memstats.sql script does just this:
> 
>       SELECT * FROM pg_get_process_memory_contexts(
>         (SELECT pid FROM pg_stat_activity
>           WHERE pid != pg_backend_pid()
>           ORDER BY random() LIMIT 1)
>         , false);
> 
>     where the inner query just picks a PID for some other backend, and asks
>     for memory context stats for that.
> 
>     And just run it like this on a scale 1 pgbench database:
> 
>       pgbench -n -f memstats.sql -c 10 test
> 
>     And it gets stuck *immediately*. I've seen it to wait for other client
>     backends and auxiliary processes like autovacuum launcher.
> 
>     This is absolutely idle system, there's no reason why a process would
>     not respond almost immediately.
> 
>  
> In my reproduction, this issue occurred because the process was terminated 
> while the requesting backend was waiting on the condition variable to be 
> signaled by it. I don’t see any solution other than having the waiting
> client 
> backend timeout using ConditionVariableTimedSleep.
> 
> In the patch, since the timeout was set to a high value, pgbench ended
> up stuck 
> waiting for the timeout to occur. The failure happens less frequently
> after I added an
> additional check for the process's existence, but it cannot be entirely 
> avoided. This is because a process can terminate after we check for its
> existence but 
> before it signals the client. In such cases, the client will not receive
> any signal.
> 

Hmmm, I see. I guess there's no way to know if a process responds to us,
but I guess it should be possible to wake up regularly and check if the
process still exists? Wouldn't that solve the case you mentioned?

>     I wonder if e.g. autovacuum launcher may
>     not be handling these requests, or what if client backends can wait in a
>     cycle.
> 
>  
> Did not see a cyclic wait in client backends due to the pgbench stress test.
>  

Not sure, but if I modify the query to only request memory contexts from
non-client processes, i.e.

  SELECT * FROM pg_get_process_memory_contexts(
    (SELECT pid FROM pg_stat_activity
      WHERE pid != pg_backend_pid()
        AND backend_type != 'client backend'
      ORDER BY random() LIMIT 1)
    , false);

then it gets stuck and reports this:

  pgbench -n -f select.sql -c 4 -T 10 test
  pgbench (18devel)
  WARNING:  Wait for 105029 process to publish stats timed out, ...

But process 105029 still very much exists, and it's the checkpointer:

  $ ps ax | grep 105029
  105029 ?        Ss     0:00 postgres: checkpointer

OTOH if I modify the script to only look at client backends, and wait
until the processes get "stuck" (i.e. waiting on the condition variable,
consuming 0% CPU), I get this:

$ pgbench -n -f select.sql -c 4 -T 10 test
pgbench (18devel)
WARNING:  Wait for 107146 process to publish stats timed out, try again
WARNING:  Wait for 107144 process to publish stats timed out, try again
WARNING:  Wait for 107147 process to publish stats timed out, try again
transaction type: select.sql
...

but when it gets 'stuck', most of the processes are still very much
running (but waiting for contexts from some other process). In the above
example I see this:

 107144 ?        Ss     0:02 postgres: user test [local] SELECT
 107145 ?        Ss     0:01 postgres: user test [local] SELECT
 107147 ?        Ss     0:02 postgres: user test [local] SELECT

So yes, 107146 seems to be gone. But why would that block getting info
from 107144 and 107147?

Maybe that's acceptable, but couldn't this be an issue with short-lived
connections, making it hard to implement the kind of automated
collection of stats that you envision. If it hits this kind of timeouts
often, it'll make it hard to reliably collect info. No?

> 
>       > I opted for DSAs over DSMs to enable memory reuse by freeing
>       > segments for subsequent statistics copies of the same backend,
>       > without needing to recreate DSMs for each request.
> 
>     I feel like this might be a premature optimization - I don't have a
>     clear idea how expensive it is to create DSM per request, but my
>     intuition is that it's cheaper than processing the contexts and
>     generating the info.
> 
>     I'd just remove that, unless someone demonstrates it really matters. I
>     don't really worry about how expensive it is to process a request
>     (within reason, of course) - it will happen only very rarely. It's more
>     important to make sure there's no overhead when no one asks the backend
>     for memory context info, and simplicity.
> 
>     Also, how expensive it is to just keep the DSA "just in case"? Imagine
>     someone asks for the memory context info once - isn't it a was to still
>     keep the DSA? I don't recall how much resources could that be.
> 
>     I don't have a clear opinion on that, I'm more asking for opinions.
> 
>   
> Imagining a tool that periodically queries the backends for statistics, 
> it would be beneficial to avoid recreating the DSAs for each call.

I think it would be nice if you backed this with some numbers. I mean,
how expensive is it to create/destroy the DSA? How does it compare to
the other stuff this function needs to do?

> Currently,  DSAs of size 1MB per process 
> (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes)) 
> would be created and pinned for subsequent reporting. This size does 
> not seem excessively high, even for approx 100 backends and 
> auxiliary processes. 
> 

That seems like a pretty substantial amount of memory reserved for each
connection. IMHO the benefits would have to be pretty significant to
justify this, especially considering it's kept "forever", even if you
run the function only once per day.

> 
>     8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
>     is necessary, I think it should not be hard-coded.
> 
> Not sure which is the ideal value. Changed it to 15 secs and added a
> #define as of now. 
> Something that gives enough time for the process to respond but 
> does not hold up the client for too long would be ideal. 15 secs seem to 
> be not enough for github CI tests, which fail with timeout error with
> this setting.
> 
> PFA an updated patch with the above changes.

Why not to make this a parameter of the function? With some sensible
default, but easy to override.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Added prosupport function for estimating numeric generate_series rows
Next
From: Jim Jones
Date:
Subject: Re: Truncate logs by max_log_size