Thread: Re: Enhancing Memory Context Statistics Reporting

Re: Enhancing Memory Context Statistics Reporting

From
Alvaro Herrera
Date:
On 2024-Nov-14, Michael Paquier wrote:

> Already mentioned previously at [1] and echoing with some surrounding
> arguments, but I'd suggest to keep it simple and just remove entirely
> the part of the patch where the stats information gets spilled into
> disk.  With more than 6000-ish context information available with a
> hard limit in place, there should be plenty enough to know what's
> going on anyway.

Functionally-wise I don't necessarily agree with _removing_ the spill
code, considering that production systems with thousands of tables would
easily reach that number of contexts (each index gets its own index info
context, each regexp gets its own memcxt); and I don't think silently
omitting a fraction of people's memory situation (or erroring out if the
case is hit) is going to make us any friends.

That said, it worries me that we choose a shared memory size so large
that it becomes impractical to hit the spill-to-disk code in regression
testing.  Maybe we can choose a much smaller limit size when
USE_ASSERT_CHECKING is enabled, and use a test that hits that number?
That way, we know the code is being hit and tested, without imposing a
huge memory consumption on test machines.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)



Re: Enhancing Memory Context Statistics Reporting

From
Ashutosh Bapat
Date:
On Wed, Nov 20, 2024 at 2:39 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
>
> Hi,
>
>> To achieve both completeness and avoid writing to a file, I can consider
>> displaying the numbers for the remaining contexts as a cumulative total
>> at the end of the output.
>>
>> Something like follows:
>> ```
>> postgres=# select * from pg_get_process_memory_contexts('237244', false);
>>                  name                  |                     ident                      |   type   |     path     |
total_bytes| tot 
>> al_nblocks | free_bytes | free_chunks | used_bytes |  pid
>>
---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
>> -----------+------------+-------------+------------+--------
>>  TopMemoryContext                      |                                                | AllocSet | {0}          |
    97696 | 
>>          5 |      14288 |          11 |      83408 | 237244
>>  search_path processing cache          |                                                | AllocSet | {0,1}        |
     8192 | 
>>          1 |       5328 |           7 |       2864 | 237244
>> Remaining contexts total:  23456 bytes (total_bytes) , 12345(used_bytes),  11,111(free_bytes)
>>
>> ```
>
>
> Please find attached an updated patch with this change. The file previously used to
> store spilled statistics has been removed. Instead, a cumulative total of the
> remaining/spilled context statistics is now stored in the DSM segment, which is
> displayed as follows.
>
> postgres=# select * from pg_get_process_memory_contexts('352966', false);
>              name             | ident |   type   |  path  | total_bytes | total_nblocks | free_bytes | free_chunks |
used_bytes|  pi 
> d
>
------------------------------+-------+----------+--------+-------------+---------------+------------+-------------+------------+----
> ----
>  TopMemoryContext             |       | AllocSet | {0}    |       97696 |             5 |      14288 |          11 |
   83408 | 352 
> 966
> .
> .
> .
>  MdSmgr                       |       | AllocSet | {0,18} |        8192 |             1 |       7424 |           0 |
     768 | 352 
> 966
>  Remaining Totals             |       |          |        |     1756016 |           188 |     658584 |         132 |
 1097432 | 352 
> 966
> (7129 rows)
> -----
>
> I believe this serves as a good compromise between completeness
> and avoiding the overhead of file handling. However, I am open to
> reintroducing file handling if displaying the complete statistics of the
> remaining contexts prove to be more important.
>
> All the known bugs in the patch have been fixed.
>
> In summary, one DSA  per PostgreSQL process is used to share
> the statistics of that process. A DSA is created by the first client
> backend that requests memory context statistics, and it is pinned
> for all future requests to that process.
> A handle to this DSA is shared between the client and the publishing
> process using fixed shared memory. The fixed shared memory consists
> of an array of size MaxBackends + auxiliary processes, indexed
> by procno. Each element in this array is less than 100 bytes in size.
>
> A PostgreSQL process uses a condition variable to signal a waiting client
> backend once it has finished publishing the statistics. If, for some reason,
> the signal is not sent, the waiting client backend will time out.

How does the process know that the client backend has finished reading
stats and it can be refreshed? What happens, if the next request for
memory context stats comes before first requester has consumed the
statistics it requested?

Does the shared memory get deallocated when the backend which
allocated it exits?

>
> When statistics of a local backend is requested, this function returns the following
> WARNING and exits, since this can be handled by an existing function which
> doesn't require a DSA.
>
> WARNING:  cannot return statistics for local backend
> HINT:  Use pg_get_backend_memory_contexts instead

How about using pg_get_backend_memory_contexts() for both - local as
well as other backend? Let PID argument default to NULL which would
indicate local backend, otherwise some other backend?

--
Best Wishes,
Ashutosh Bapat



Re: Enhancing Memory Context Statistics Reporting

From
Ashutosh Bapat
Date:
On Fri, Nov 22, 2024 at 6:33 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
>
> Hi,
>
>> How does the process know that the client backend has finished reading
>> stats and it can be refreshed? What happens, if the next request for
>> memory context stats comes before first requester has consumed the
>> statistics it requested?
>>
> A process that's copying its statistics does not need to know that.
> Whenever it receives a signal to copy statistics, it goes ahead and
> copies the latest statistics to the DSA after acquiring an exclusive
> lwlock.
>
> A requestor takes a lock before it starts consuming the statistics.
> If the next request comes while the first requestor is consuming the
> statistics, the publishing process will wait on lwlock to be released
> by the consuming process before it can write the statistics.
> If the next request arrives before the first requester begins consuming
> the statistics, the publishing process will acquire the lock and overwrite
> the earlier statistics with the most recent ones.
> As a result, both the first and second requesters will consume the
> updated statistics.

IIUC, the publisher and the consumer processes, both, use the same
LWLock. Publisher acquires an exclusive lock. Does consumer acquire
SHARED lock?

The publisher process might be in a transaction, processing a query or
doing something else. If it has to wait for an LWLock may affect its
performance. This will become even more visible if the client backend
is trying to diagnose a slow running query. Have we tried to measure
how long the publisher might have to wait for an LWLock while the
consumer is consuming statistics OR what is the impact of this wait?

>> >
>> > When statistics of a local backend is requested, this function returns the following
>> > WARNING and exits, since this can be handled by an existing function which
>> > doesn't require a DSA.
>> >
>> > WARNING:  cannot return statistics for local backend
>> > HINT:  Use pg_get_backend_memory_contexts instead
>>
>> How about using pg_get_backend_memory_contexts() for both - local as
>> well as other backend? Let PID argument default to NULL which would
>> indicate local backend, otherwise some other backend?
>>
> I don't see much value in combining the two, specially since with
> pg_get_process_memory_contexts() we can query both the postgres
> backend and a background process, the name pg_get_backend_memory_context()
> would be inaccurate and I am not sure whether a change to rename the
> existing function would be welcome.

Having two separate functions for the same functionality isn't a
friendly user interface.

Playing a bit with pg_terminate_backend() which is another function
dealing with backends to understand a. what does it do to its own
backend and b. which processes are considered backends.

1. pg_terminate_backend() allows to terminate the backend from which
it is fired.
#select pid, application_name, backend_type, pg_terminate_backend(pid)
from pg_stat_activity;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

2. It considers autovacuum launcher and logical replication launcher
as postgres backends but not checkpointer, background writer and
walwriter.
#select pid, application_name, backend_type, pg_terminate_backend(pid)
from pg_stat_activity where pid <> pg_backend_pid();
WARNING:  PID 644887 is not a PostgreSQL backend process
WARNING:  PID 644888 is not a PostgreSQL backend process
WARNING:  PID 644890 is not a PostgreSQL backend process
  pid   | application_name |         backend_type         | pg_terminate_backend
--------+------------------+------------------------------+----------------------
 645636 |                  | autovacuum launcher          | t
 645677 |                  | logical replication launcher | t
 644887 |                  | checkpointer                 | f
 644888 |                  | background writer            | f
 644890 |                  | walwriter                    | f
(5 rows)

In that sense you are correct that pg_get_backend_memory_context()
should not provide context information of WAL writer process for
example. But pg_get_process_memory_contexts() would be expected to
provide its own memory context information instead of redirecting to
another function through a WARNING. It could do that redirection
itself. That will also prevent the functions' output format going out
of sync.

--
Best Wishes,
Ashutosh Bapat



Re: Enhancing Memory Context Statistics Reporting

From
Tomas Vondra
Date:
Hi,

I took a quick look at the patch today. Overall, I think this would be
very useful, I've repeatedly needed to inspect why a backend uses so
much memory, and I ended up triggering MemoryContextStats() from gdb.
This would be more convenient / safer. So +1 to the patch intent.


A couple review comments:

1) I read through the thread, and in general I agree with the reasoning
for removing the file part - it seems perfectly fine to just dump as
much as we can fit into a buffer, and then summarize the rest. But do we
need to invent a "new" limit here? The other places logging memory
contexts do something like this:

   MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);

Which means we only print the 100 memory contexts at the top, and that's
it. Wouldn't that give us a reasonable memory limit too?


2) I see the function got renamed to pg_get_process_memory_contexts(),
bu the comment still says pg_get_remote_backend_memory_contexts().


3) I don't see any SGML docs for this new function. I was a bit unsure
what the "summary" argument is meant to do. The comment does not explain
that either.


4) I wonder if the function needs to return PID. I mean, the caller
knows which PID it is for, so it seems rather unnecessary.


5) In the "summary" mode, it might be useful to include info about how
many child contexts were aggregated. It's useful to know whether there
was 1 child or 10000 children. In the regular (non-summary) mode it'd
always be "1", probably, but maybe it'd interact with the limit in (1).
Not sure.


6) I feel a bit uneasy about the whole locking / communication scheme.
In particular, I'm worried about lockups, deadlocks, etc. So I decided
to do a trivial stress-test - just run the new function through pgbench
with many clients.

The memstats.sql script does just this:

  SELECT * FROM pg_get_process_memory_contexts(
    (SELECT pid FROM pg_stat_activity
      WHERE pid != pg_backend_pid()
      ORDER BY random() LIMIT 1)
    , false);

where the inner query just picks a PID for some other backend, and asks
for memory context stats for that.

And just run it like this on a scale 1 pgbench database:

  pgbench -n -f memstats.sql -c 10 test

And it gets stuck *immediately*. I've seen it to wait for other client
backends and auxiliary processes like autovacuum launcher.

This is absolutely idle system, there's no reason why a process would
not respond almost immediately. I wonder if e.g. autovacuum launcher may
not be handling these requests, or what if client backends can wait in a
cycle. IIRC condition variables are not covered by a deadlock detector,
so that would be an issue. But maybe I remember wrong?


7) I've also seen this error:

  pgbench: error: client 6 script 0 aborted in command 0 query 0: \
  ERROR:  can't attach the same segment more than once

I haven't investigated it, but it seems like a problem handling errors,
where we fail to detach from a segment after a timeout. I may be wrong,
but it might be related to this:

  > I opted for DSAs over DSMs to enable memory reuse by freeing
  > segments for subsequent statistics copies of the same backend,
  > without needing to recreate DSMs for each request.

I feel like this might be a premature optimization - I don't have a
clear idea how expensive it is to create DSM per request, but my
intuition is that it's cheaper than processing the contexts and
generating the info.

I'd just remove that, unless someone demonstrates it really matters. I
don't really worry about how expensive it is to process a request
(within reason, of course) - it will happen only very rarely. It's more
important to make sure there's no overhead when no one asks the backend
for memory context info, and simplicity.

Also, how expensive it is to just keep the DSA "just in case"? Imagine
someone asks for the memory context info once - isn't it a was to still
keep the DSA? I don't recall how much resources could that be.

I don't have a clear opinion on that, I'm more asking for opinions.


8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
is necessary, I think it should not be hard-coded.


regards

-- 
Tomas Vondra




Re: Enhancing Memory Context Statistics Reporting

From
Tomas Vondra
Date:
On 11/29/24 00:23, Rahila Syed wrote:
> Hi Tomas,
> 
> Thank you for the review.
> 
> 
> 
>     1) I read through the thread, and in general I agree with the reasoning
>     for removing the file part - it seems perfectly fine to just dump as
>     much as we can fit into a buffer, and then summarize the rest. But do we
>     need to invent a "new" limit here? The other places logging memory
>     contexts do something like this:
> 
>        MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);
> 
>     Which means we only print the 100 memory contexts at the top, and that's
>     it. Wouldn't that give us a reasonable memory limit too?
> 
> I think this prints more than 100 memory contexts, since 100 denotes the
> max_level
> and contexts at each level could have upto 100 children. This limit
> seems much higher than
> what I am currently storing in DSA which is approx. 7000 contexts.  I
> will verify this again.
>  

Yeah, you may be right. I don't remember what exactly that limit does.

> 
>     2) I see the function got renamed to pg_get_process_memory_contexts(),
>     bu the comment still says pg_get_remote_backend_memory_contexts().
> 
> Fixed 
> 
> 
>     3) I don't see any SGML docs for this new function. I was a bit unsure
>     what the "summary" argument is meant to do. The comment does not explain
>     that either.
> 
> Added docs. 
> Intention behind adding a summary argument is to report statistics of
> contexts at level 0 
> and 1 i.e TopMemoryContext and its immediate children. 
> 

OK

>     4) I wonder if the function needs to return PID. I mean, the caller
>     knows which PID it is for, so it seems rather unnecessary.
> 
> Perhaps it can be used to ascertain that the information indeed belongs to 
> the requested pid.
> 

I find that a bit ... suspicious. By this logic we'd include the input
parameters in every result, but we don't. So why is this case different?

>     5) In the "summary" mode, it might be useful to include info about how
>     many child contexts were aggregated. It's useful to know whether there
>     was 1 child or 10000 children. In the regular (non-summary) mode it'd
>     always be "1", probably, but maybe it'd interact with the limit in (1).
>     Not sure.
> 
> Sure,  I will add this in the next iteration. 
> 

OK

> 
>     6) I feel a bit uneasy about the whole locking / communication scheme.
>     In particular, I'm worried about lockups, deadlocks, etc. So I decided
>     to do a trivial stress-test - just run the new function through pgbench
>     with many clients.
> 
>     The memstats.sql script does just this:
> 
>       SELECT * FROM pg_get_process_memory_contexts(
>         (SELECT pid FROM pg_stat_activity
>           WHERE pid != pg_backend_pid()
>           ORDER BY random() LIMIT 1)
>         , false);
> 
>     where the inner query just picks a PID for some other backend, and asks
>     for memory context stats for that.
> 
>     And just run it like this on a scale 1 pgbench database:
> 
>       pgbench -n -f memstats.sql -c 10 test
> 
>     And it gets stuck *immediately*. I've seen it to wait for other client
>     backends and auxiliary processes like autovacuum launcher.
> 
>     This is absolutely idle system, there's no reason why a process would
>     not respond almost immediately.
> 
>  
> In my reproduction, this issue occurred because the process was terminated 
> while the requesting backend was waiting on the condition variable to be 
> signaled by it. I don’t see any solution other than having the waiting
> client 
> backend timeout using ConditionVariableTimedSleep.
> 
> In the patch, since the timeout was set to a high value, pgbench ended
> up stuck 
> waiting for the timeout to occur. The failure happens less frequently
> after I added an
> additional check for the process's existence, but it cannot be entirely 
> avoided. This is because a process can terminate after we check for its
> existence but 
> before it signals the client. In such cases, the client will not receive
> any signal.
> 

Hmmm, I see. I guess there's no way to know if a process responds to us,
but I guess it should be possible to wake up regularly and check if the
process still exists? Wouldn't that solve the case you mentioned?

>     I wonder if e.g. autovacuum launcher may
>     not be handling these requests, or what if client backends can wait in a
>     cycle.
> 
>  
> Did not see a cyclic wait in client backends due to the pgbench stress test.
>  

Not sure, but if I modify the query to only request memory contexts from
non-client processes, i.e.

  SELECT * FROM pg_get_process_memory_contexts(
    (SELECT pid FROM pg_stat_activity
      WHERE pid != pg_backend_pid()
        AND backend_type != 'client backend'
      ORDER BY random() LIMIT 1)
    , false);

then it gets stuck and reports this:

  pgbench -n -f select.sql -c 4 -T 10 test
  pgbench (18devel)
  WARNING:  Wait for 105029 process to publish stats timed out, ...

But process 105029 still very much exists, and it's the checkpointer:

  $ ps ax | grep 105029
  105029 ?        Ss     0:00 postgres: checkpointer

OTOH if I modify the script to only look at client backends, and wait
until the processes get "stuck" (i.e. waiting on the condition variable,
consuming 0% CPU), I get this:

$ pgbench -n -f select.sql -c 4 -T 10 test
pgbench (18devel)
WARNING:  Wait for 107146 process to publish stats timed out, try again
WARNING:  Wait for 107144 process to publish stats timed out, try again
WARNING:  Wait for 107147 process to publish stats timed out, try again
transaction type: select.sql
...

but when it gets 'stuck', most of the processes are still very much
running (but waiting for contexts from some other process). In the above
example I see this:

 107144 ?        Ss     0:02 postgres: user test [local] SELECT
 107145 ?        Ss     0:01 postgres: user test [local] SELECT
 107147 ?        Ss     0:02 postgres: user test [local] SELECT

So yes, 107146 seems to be gone. But why would that block getting info
from 107144 and 107147?

Maybe that's acceptable, but couldn't this be an issue with short-lived
connections, making it hard to implement the kind of automated
collection of stats that you envision. If it hits this kind of timeouts
often, it'll make it hard to reliably collect info. No?

> 
>       > I opted for DSAs over DSMs to enable memory reuse by freeing
>       > segments for subsequent statistics copies of the same backend,
>       > without needing to recreate DSMs for each request.
> 
>     I feel like this might be a premature optimization - I don't have a
>     clear idea how expensive it is to create DSM per request, but my
>     intuition is that it's cheaper than processing the contexts and
>     generating the info.
> 
>     I'd just remove that, unless someone demonstrates it really matters. I
>     don't really worry about how expensive it is to process a request
>     (within reason, of course) - it will happen only very rarely. It's more
>     important to make sure there's no overhead when no one asks the backend
>     for memory context info, and simplicity.
> 
>     Also, how expensive it is to just keep the DSA "just in case"? Imagine
>     someone asks for the memory context info once - isn't it a was to still
>     keep the DSA? I don't recall how much resources could that be.
> 
>     I don't have a clear opinion on that, I'm more asking for opinions.
> 
>   
> Imagining a tool that periodically queries the backends for statistics, 
> it would be beneficial to avoid recreating the DSAs for each call.

I think it would be nice if you backed this with some numbers. I mean,
how expensive is it to create/destroy the DSA? How does it compare to
the other stuff this function needs to do?

> Currently,  DSAs of size 1MB per process 
> (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes)) 
> would be created and pinned for subsequent reporting. This size does 
> not seem excessively high, even for approx 100 backends and 
> auxiliary processes. 
> 

That seems like a pretty substantial amount of memory reserved for each
connection. IMHO the benefits would have to be pretty significant to
justify this, especially considering it's kept "forever", even if you
run the function only once per day.

> 
>     8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
>     is necessary, I think it should not be hard-coded.
> 
> Not sure which is the ideal value. Changed it to 15 secs and added a
> #define as of now. 
> Something that gives enough time for the process to respond but 
> does not hold up the client for too long would be ideal. 15 secs seem to 
> be not enough for github CI tests, which fail with timeout error with
> this setting.
> 
> PFA an updated patch with the above changes.

Why not to make this a parameter of the function? With some sensible
default, but easy to override.


regards

-- 
Tomas Vondra




Re: Enhancing Memory Context Statistics Reporting

From
Tomas Vondra
Date:
On 12/3/24 20:09, Rahila Syed wrote:
> 
> Hi,
>  
> 
> 
> 
>     >     4) I wonder if the function needs to return PID. I mean, the
>     caller
>     >     knows which PID it is for, so it seems rather unnecessary.
>     >
>     > Perhaps it can be used to ascertain that the information indeed
>     belongs to 
>     > the requested pid.
>     >
> 
>     I find that a bit ... suspicious. By this logic we'd include the input
>     parameters in every result, but we don't. So why is this case different?
> 
>  
> This was added to address a review suggestion. I had left it in case
> anyone found it useful 
> for verification. 
> Previously, I included a check for scenarios where multiple processes
> could write to the same 
> shared memory. Now, each process has a separate shared memory space
> identified by 
> pgprocno, making it highly unlikely for the receiving process to see
> another process's memory 
> dump.
> Such a situation could theoretically occur if another process were
> mapped to the same 
> pgprocno, although I’m not sure how likely that is. That said, I’ve
> added a check in the receiver
> to ensure the PID written in the shared memory matches the PID for which
> the dump is 
> requested. 
> This guarantees that a user will never see the memory dump of another
> process.
> Given this, I’m fine with removing the pid column if it helps to make
> the output more readable.
> 

I'd just remove that. I agree it might have been useful with the single
chunk of shared memory, but I think with separate chunks it's not very
useful. And if we can end up with multiple processed getting the same
pgprocno I guess we have way bigger problems, this won't fix that.

>     >     5) In the "summary" mode, it might be useful to include info
>     about how
>     >     many child contexts were aggregated. It's useful to know
>     whether there
>     >     was 1 child or 10000 children. In the regular (non-summary)
>     mode it'd
>     >     always be "1", probably, but maybe it'd interact with the
>     limit in (1).
>     >     Not sure.
>     >
>     > Sure,  I will add this in the next iteration. 
>     >
> 
>     OK
> 
>  
> I have added this information as a column named "num_agg_contexts",
> which indicates 
> the number of contexts whose statistics have been aggregated/added for a
> particular output.
> 
> In summary mode, all the child contexts of a given level-1 context are
> aggregated, and 
> their statistics are presented as part of the parent context's
> statistics. In this case, 
> num_agg_contexts  provides the count of all child contexts under a given
> level-1 context.
> 
> In regular (non-summary) mode, this column shows a value of 1, meaning
> the statistics 
> correspond to a single context, with all context statistics displayed
> individually. In this mode
> an aggregate result is displayed if the number of contexts exceed the
> DSA size limit. In
> this case the num_agg_contexts will display the number of the remaining
> contexts.
> 

OK

>     >      
>     > In the patch, since the timeout was set to a high value, pgbench ended
>     > up stuck 
>     > waiting for the timeout to occur. The failure happens less frequently
>     > after I added an
>     > additional check for the process's existence, but it cannot be
>     entirely 
>     > avoided. This is because a process can terminate after we check
>     for its
>     > existence but 
>     > before it signals the client. In such cases, the client will not
>     receive
>     > any signal.
>     >
> 
>     Hmmm, I see. I guess there's no way to know if a process responds to us,
>     but I guess it should be possible to wake up regularly and check if the
>     process still exists? Wouldn't that solve the case you mentioned?
> 
> I have fixed it accordingly in the attached patch by waking up after
> every 5 seconds
> to check if the process exists and sleeping again if the wake-up condition
> is not satisfied.  The number of such tries is limited to 20. So, the
> total wait 
> time can be 100 seconds. I will make the re-tries configurable, inline
> with your 
> suggestion to be able to override the default waiting time.
>  

Makes sense, although 100 seconds seems a bit weird, it seems we usually
pick "natural" values like 60s, or multiples of that. But if it's
configurable, that's not a huge issue.

Could the process wake up earlier than the timeout, say if it gets EINT
signal? That'd break the "total timeout is 100 seconds", and it would be
better to check that explicitly. Not sure if this can happen, though.

One thing I'd maybe consider is starting with a short timeout, and
gradually increasing it until e.g. 5 seconds (or maybe just 1 second
would be perfectly fine, IMHO). With the current coding it means we
either get the response right away, or wait 5+ seconds. That's a big
huge jump. If we start e.g. with 10ms, and then gradually multiply it by
1.2, it means we only wait "0-20% extra" on average.

But perhaps this is very unlikely and not worth the complexity.

> 
>     >     I wonder if e.g. autovacuum launcher may
>     >     not be handling these requests, or what if client backends can
>     wait in a
>     >     cycle.
>     >
>     >  
>     > Did not see a cyclic wait in client backends due to the pgbench
>     stress test.
>     >  
> 
>     Not sure, but if I modify the query to only request memory contexts from
>     non-client processes, i.e.
> 
>       SELECT * FROM pg_get_process_memory_contexts(
>         (SELECT pid FROM pg_stat_activity
>           WHERE pid != pg_backend_pid()
>             AND backend_type != 'client backend'
>           ORDER BY random() LIMIT 1)
>         , false);
> 
>     then it gets stuck and reports this:
> 
>       pgbench -n -f select.sql -c 4 -T 10 test
>       pgbench (18devel)
>       WARNING:  Wait for 105029 process to publish stats timed out, ...
> 
>     But process 105029 still very much exists, and it's the checkpointer:
> 
> In the case of checkpointer, I also see some wait time after running the
> tests that you mentioned, but it eventually completes the request in my
> runs.
>  

OK, but why should it even wait that long? Surely the checkpointer
should be able to report memory contexts too?

> 
>       $ ps ax | grep 105029
>       105029 ?        Ss     0:00 postgres: checkpointer
> 
>     OTOH if I modify the script to only look at client backends, and wait
>     until the processes get "stuck" (i.e. waiting on the condition variable,
>     consuming 0% CPU), I get this:
> 
>     $ pgbench -n -f select.sql -c 4 -T 10 test
>     pgbench (18devel)
>     WARNING:  Wait for 107146 process to publish stats timed out, try again
>     WARNING:  Wait for 107144 process to publish stats timed out, try again
>     WARNING:  Wait for 107147 process to publish stats timed out, try again
>     transaction type: select.sql
>     ...
> 
>     but when it gets 'stuck', most of the processes are still very much
>     running (but waiting for contexts from some other process). In the above
>     example I see this:
> 
>      107144 ?        Ss     0:02 postgres: user test [local] SELECT
>      107145 ?        Ss     0:01 postgres: user test [local] SELECT
>      107147 ?        Ss     0:02 postgres: user test [local] SELECT
> 
>     So yes, 107146 seems to be gone. But why would that block getting info
>     from 107144 and 107147?
> 
> Most likely 107144 and/or 107147 must also be waiting for 107146 which is 
> gone. Something like 107144 -> 107147 -> 107146(dead) or 107144 -
>>107146(dead)
> and 107147->107146(dead).
> 

I think I forgot to mention only 107145 was waiting for 107146 (dead),
and the other processes were waiting for 107145 in some way. But yeah,
detecting the dead process would improve this, although it also shows
the issues can "spread" easily.

OTOH it's unlikely to have multiple pg_get_process_memory_contexts()
queries pointing at each other like this - monitoring will just to that
from one backend, and that's it. So not a huge issue.

> 
>     Maybe that's acceptable, but couldn't this be an issue with short-lived
>     connections, making it hard to implement the kind of automated
>     collection of stats that you envision. If it hits this kind of timeouts
>     often, it'll make it hard to reliably collect info. No?
> 
> 
> Yes, if there is a chain of waiting clients due to a process no longer
> existing, 
> the waiting time to receive information will increase. However, as long
> as a failed 
> a request caused by a non-existent process is detected promptly, the
> wait time should 
> remain manageable, allowing other waiting clients to obtain the
> requested information 
> from the existing processes.
> 
> In such cases, it might be necessary to experiment with the waiting
> times at the receiving 
> client. Making the waiting time user-configurable, as you suggested, by
> passing it as an 
> argument to the function, could help address this scenario.
> Thanks for highlighting this, I will test this some more. 
>  

I think we should try very hard to make this work well without the user
having to mess with the timeouts. These are exceptional conditions that
happen only very rarely, which makes it hard to find good values.

> 
>     >
>     >       > I opted for DSAs over DSMs to enable memory reuse by freeing
>     >       > segments for subsequent statistics copies of the same backend,
>     >       > without needing to recreate DSMs for each request.
>     >
>     >     I feel like this might be a premature optimization - I don't
>     have a
>     >     clear idea how expensive it is to create DSM per request, but my
>     >     intuition is that it's cheaper than processing the contexts and
>     >     generating the info.
>     >
>     >     I'd just remove that, unless someone demonstrates it really
>     matters. I
>     >     don't really worry about how expensive it is to process a request
>     >     (within reason, of course) - it will happen only very rarely.
>     It's more
>     >     important to make sure there's no overhead when no one asks
>     the backend
>     >     for memory context info, and simplicity.
>     >
>     >     Also, how expensive it is to just keep the DSA "just in case"?
>     Imagine
>     >     someone asks for the memory context info once - isn't it a was
>     to still
>     >     keep the DSA? I don't recall how much resources could that be.
>     >
>     >     I don't have a clear opinion on that, I'm more asking for
>     opinions.
>     >
>     >   
>     > Imagining a tool that periodically queries the backends for
>     statistics, 
>     > it would be beneficial to avoid recreating the DSAs for each call.
> 
>     I think it would be nice if you backed this with some numbers. I mean,
>     how expensive is it to create/destroy the DSA? How does it compare to
>     the other stuff this function needs to do?
> 
> After instrumenting the code with timestamps, I observed that DSA creation 
> accounts for approximately 17% to 26% of the total execution time of the
> function 
> pg_get_process_memory_contexts().
> 
>     > Currently,  DSAs of size 1MB per process 
>     > (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes)) 
>     > would be created and pinned for subsequent reporting. This size does 
>     > not seem excessively high, even for approx 100 backends and 
>     > auxiliary processes. 
>     >
> 
>     That seems like a pretty substantial amount of memory reserved for each
>     connection. IMHO the benefits would have to be pretty significant to
>     justify this, especially considering it's kept "forever", even if you
>     run the function only once per day.
> 
> I can reduce the initial segment size to DSA_MIN_SEGMENT_SIZE, which is 
> 256KB per process. If needed, this could grow up to 16MB based on the
> current settings.
> 
> However, for the scenario you mentioned, it would be ideal to have a
> mechanism 
> to mark a pinned DSA (using dsa_pin()) for deletion if it is not used/
> attached within a 
> specified duration. Alternatively, I could avoid using dsa_pin()
> altogether, allowing the 
> DSA to be automatically destroyed once all processes detach from it, and
> recreate it 
> for a new request.
> 
> At the moment, I am unsure which approach is most feasible. Any
> suggestions would be
> greatly appreciated.
> 

I'm entirely unconcerned about the pg_get_process_memory_contexts()
performance, within some reasonable limits. It's something executed
every now and then - no one is going to complain it takes 10ms extra,
measure tps with this function, etc.

17-26% seems surprisingly high, but Even 256kB is too much, IMHO. I'd
just get rid of this optimization until someone complains and explains
why it's worth it.

Yes, let's make it fast, but I don't think we should optimize it at the
expense of "regular workload" ...


regards

-- 
Tomas Vondra




Re: Enhancing Memory Context Statistics Reporting

From
Amit Langote
Date:
Hi Rahila,

Thanks for working on this.  I've wanted something like this a number
of times to replace my current method of attaching gdb like everyone
else I suppose.

I have a question / suggestion about the interface.

+Datum
+pg_get_process_memory_contexts(PG_FUNCTION_ARGS)
+{
+    int         pid = PG_GETARG_INT32(0);
+    bool        get_summary = PG_GETARG_BOOL(1);

IIUC, this always returns all memory contexts starting from
TopMemoryContext, summarizing some child contexts if memory doesn't
suffice. Would it be helpful to allow users to specify a context other
than TopMemoryContext as the root? This could be particularly useful
in cases where the information a user is looking for would otherwise
be grouped under "Remaining Totals." Alternatively, is there a way to
achieve this with the current function, perhaps by specifying a
condition in the WHERE clause?



Re: Enhancing Memory Context Statistics Reporting

From
torikoshia
Date:
Hi,

Thanks for updating the patch and here are some comments:

'path' column of pg_get_process_memory_contexts() begins with 0, but 
that column of pg_backend_memory_contexts view begins with 1:

   =# select path FROM pg_get_process_memory_contexts('20271', false);
   path
   -------
    {0}
    {0,1}
    {0,2}
    ..

=# select path from pg_backend_memory_contexts;
    path
   -------
    {1}
    {1,2}
    {1,3}
    ..asdf asdf

Would it be better to begin with 1 to make them consistent?


pg_log_backend_memory_contexts() does not allow non-superusers to 
execute by default since it can peek at other session information.
pg_get_process_memory_contexts() does not have this restriction, but 
wouldn't it be necessary?


When the target pid is the local backend, the HINT suggests using 
pg_get_backend_memory_contexts(), but this function is not described in 
the manual.
How about suggesting pg_backend_memory_contexts view instead?

   =# select pg_get_process_memory_contexts('27041', false);
   WARNING:  cannot return statistics for local backend
   HINT:  Use pg_get_backend_memory_contexts instead


There are no explanations about 'num_agg_contexts', but I thought the 
explanation like below would be useful.

> I have added this information as a column named "num_agg_contexts", 
> which indicates
> the number of contexts whose statistics have been aggregated/added for 
> a particular output.

git apply caused some warnings:

$ git apply 
v7-Function-to-report-memory-context-stats-of-any-backe.patch
v7-Function-to-report-memory-context-stats-of-any-backe.patch:71: space 
before tab in indent.
         Requests to return the memory contexts of the backend with the
v7-Function-to-report-memory-context-stats-of-any-backe.patch:72: space 
before tab in indent.
         specified process ID.  This function can send the request to
v7-Function-to-report-memory-context-stats-of-any-backe.patch:ldmv: 
space before tab in indent.
         both the backends and auxiliary processes. After receiving the 
memory
v7-Function-to-report-memory-context-stats-of-any-backe.patch:74: space 
before tab in indent.
         contexts from the process, it returns the result as one row per
v7-Function-to-report-memory-context-stats-of-any-backe.patch:75: space 
before tab in indent.
         context. When get_summary is true, memory contexts at level 0


-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.