Thread: Get memory contexts of an arbitrary backend process
Hi, After commit 3e98c0bafb28de, we can display the usage of the memory contexts using pg_backend_memory_contexts system view. However, its target is limited to the process attached to the current session. As discussed in the thread[1], it'll be useful to make it possible to get the memory contexts of an arbitrary backend process. Attached PoC patch makes pg_get_backend_memory_contexts() display meory contexts of the specified PID of the process. =# -- PID of the target process is 17051 =# SELECT * FROM pg_get_backend_memory_contexts(17051) ; name | ident | parent | level | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes -----------------------+-------+------------------+-------+-------------+---------------+------------+-------------+------------ TopMemoryContext | | | 0 | 68720 | 5 | 16816 | 16 | 51904 RowDescriptionContext | | TopMemoryContext | 1 | 8192 | 1 | 6880 | 0 | 1312 MessageContext | | TopMemoryContext | 1 | 65536 | 4 | 19912 | 1 | 45624 ... It doesn't display contexts of all the backends but only the contexts of specified process. I think it would be enough because I suppose this function is used after investigations using ps command or other OS level utilities. The rough idea of implementation is like below: 1. send a signal to the specified process 2. signaled process dumps its memory contexts to a file 3. read the dumped file and display it to the user Any thoughts? [1] https://www.postgresql.org/message-id/72a656e0f71d0860161e0b3f67e4d771%40oss.nttdata.com Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Attachment
Hi, On Mon, Aug 31, 2020 at 8:22 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > As discussed in the thread[1], it'll be useful to make it > possible to get the memory contexts of an arbitrary backend > process. +1 > Attached PoC patch makes pg_get_backend_memory_contexts() > display meory contexts of the specified PID of the process. Thanks, it's a very good patch for discussion. > It doesn't display contexts of all the backends but only > the contexts of specified process. or we can "SELECT (pg_get_backend_memory_contexts(pid)).* FROM pg_stat_activity WHERE ...", so I don't think it's a big deal. > The rough idea of implementation is like below: > > 1. send a signal to the specified process > 2. signaled process dumps its memory contexts to a file > 3. read the dumped file and display it to the user I agree with the overview of the idea. Here are some comments and questions. - Currently, "the signal transmission for dumping memory information" and "the read & output of dump information " are on the same interface, but I think it would be better to separate them. How about providing the following three types of functions for users? - send a signal to specified pid - check the status of the signal sent and received - read the dumped information - How about managing the status of signal send/receive and dump operations on a shared hash or others ? Sending and receiving signals, dumping memory information, and referencing dump information all work asynchronously. Therefore, it would be good to have management information to check the status of each process. A simple idea is that .. - send a signal to dump to a PID, it first record following information into the shared hash. pid (specified pid) loc (dump location, currently might be ASAP) recv (did the pid process receive a signal? first false) dumped (did the pid process dump a mem information? first false) - specified process receive the signal, update the status in the shared hash, then dumped at specified location. - specified process finish dump mem information, update the status in the shared hash. - Does it allow one process to output multiple dump files? It appears to be a specification to overwrite at present, but I thought it would be good to be able to generate multiple dump files in different phases (e.g., planning phase and execution phase) in the future. - How is the dump file cleaned up? Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
Hi
po 31. 8. 2020 v 17:03 odesílatel Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> napsal:
Hi,
On Mon, Aug 31, 2020 at 8:22 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
> As discussed in the thread[1], it'll be useful to make it
> possible to get the memory contexts of an arbitrary backend
> process.
+1
> Attached PoC patch makes pg_get_backend_memory_contexts()
> display meory contexts of the specified PID of the process.
Thanks, it's a very good patch for discussion.
> It doesn't display contexts of all the backends but only
> the contexts of specified process.
or we can "SELECT (pg_get_backend_memory_contexts(pid)).* FROM
pg_stat_activity WHERE ...",
so I don't think it's a big deal.
> The rough idea of implementation is like below:
>
> 1. send a signal to the specified process
> 2. signaled process dumps its memory contexts to a file
> 3. read the dumped file and display it to the user
I agree with the overview of the idea.
Here are some comments and questions.
- Currently, "the signal transmission for dumping memory information"
and "the read & output of dump information "
are on the same interface, but I think it would be better to separate them.
How about providing the following three types of functions for users?
- send a signal to specified pid
- check the status of the signal sent and received
- read the dumped information
- How about managing the status of signal send/receive and dump
operations on a shared hash or others ?
Sending and receiving signals, dumping memory information, and
referencing dump information all work asynchronously.
Therefore, it would be good to have management information to check
the status of each process.
A simple idea is that ..
- send a signal to dump to a PID, it first record following
information into the shared hash.
pid (specified pid)
loc (dump location, currently might be ASAP)
recv (did the pid process receive a signal? first false)
dumped (did the pid process dump a mem information? first false)
- specified process receive the signal, update the status in the
shared hash, then dumped at specified location.
- specified process finish dump mem information, update the status
in the shared hash.
- Does it allow one process to output multiple dump files?
It appears to be a specification to overwrite at present, but I
thought it would be good to be able to generate
multiple dump files in different phases (e.g., planning phase and
execution phase) in the future.
- How is the dump file cleaned up?
For a very long time there has been similar discussion about taking session query and session execution plans from other sessions.
I am not sure how necessary information is in the memory dump, but I am sure so taking the current execution plan and complete text of the current query is pretty necessary information.
but can be great so this infrastructure can be used for any debugging purpose.
Regards
Pavel
Best regards,
--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com
Hi, On 2020-08-31 20:22:18 +0900, torikoshia wrote: > After commit 3e98c0bafb28de, we can display the usage of the > memory contexts using pg_backend_memory_contexts system > view. > > However, its target is limited to the process attached to > the current session. > > As discussed in the thread[1], it'll be useful to make it > possible to get the memory contexts of an arbitrary backend > process. > > Attached PoC patch makes pg_get_backend_memory_contexts() > display meory contexts of the specified PID of the process. Awesome! > It doesn't display contexts of all the backends but only > the contexts of specified process. > I think it would be enough because I suppose this function > is used after investigations using ps command or other OS > level utilities. It can be used as a building block if all are needed. Getting the infrastructure right is the big thing here, I think. Adding more detailed views on top of that data later is easier. > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql > index a2d61302f9..88fb837ecd 100644 > --- a/src/backend/catalog/system_views.sql > +++ b/src/backend/catalog/system_views.sql > @@ -555,10 +555,10 @@ REVOKE ALL ON pg_shmem_allocations FROM PUBLIC; > REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC; > > CREATE VIEW pg_backend_memory_contexts AS > - SELECT * FROM pg_get_backend_memory_contexts(); > + SELECT * FROM pg_get_backend_memory_contexts(-1); -1 is odd. Why not use NULL or even 0? > + else > + { > + int rc; > + int parent_len = strlen(parent); > + int name_len = strlen(name); > + > + /* > + * write out the current memory context information. > + * Since some elements of values are reusable, we write it out. Not sure what the second comment line here is supposed to mean? > + */ > + fputc('D', fpout); > + rc = fwrite(values, sizeof(values), 1, fpout); > + rc = fwrite(nulls, sizeof(nulls), 1, fpout); > + > + /* write out information which is not resuable from serialized values */ s/resuable/reusable/ > + rc = fwrite(&name_len, sizeof(int), 1, fpout); > + rc = fwrite(name, name_len, 1, fpout); > + rc = fwrite(&idlen, sizeof(int), 1, fpout); > + rc = fwrite(clipped_ident, idlen, 1, fpout); > + rc = fwrite(&level, sizeof(int), 1, fpout); > + rc = fwrite(&parent_len, sizeof(int), 1, fpout); > + rc = fwrite(parent, parent_len, 1, fpout); > + (void) rc; /* we'll check for error with ferror */ > + > + } This format is not descriptive. How about serializing to json or something? Or at least having field names? Alternatively, build the same tuple we build for the SRF, and serialize that. Then there's basically no conversion needed. > @@ -117,6 +157,8 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore, > Datum > pg_get_backend_memory_contexts(PG_FUNCTION_ARGS) > { > + int pid = PG_GETARG_INT32(0); > + > ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > TupleDesc tupdesc; > Tuplestorestate *tupstore; > @@ -147,11 +189,258 @@ pg_get_backend_memory_contexts(PG_FUNCTION_ARGS) > > MemoryContextSwitchTo(oldcontext); > > - PutMemoryContextsStatsTupleStore(tupstore, tupdesc, > - TopMemoryContext, NULL, 0); > + if (pid == -1) > + { > + /* > + * Since pid -1 indicates target is the local process, simply > + * traverse memory contexts. > + */ > + PutMemoryContextsStatsTupleStore(tupstore, tupdesc, > + TopMemoryContext, "", 0, NULL); > + } > + else > + { > + /* > + * Send signal for dumping memory contexts to the target process, > + * and read the dumped file. > + */ > + FILE *fpin; > + char dumpfile[MAXPGPATH]; > + > + SendProcSignal(pid, PROCSIG_DUMP_MEMORY, InvalidBackendId); > + > + snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", pid); > + > + while (true) > + { > + CHECK_FOR_INTERRUPTS(); > + > + pg_usleep(10000L); > + Need better signalling back/forth here. > +/* > + * dump_memory_contexts > + * Dumping local memory contexts to a file. > + * This function does not delete the file as it is intended to be read by > + * another process. > + */ > +static void > +dump_memory_contexts(void) > +{ > + FILE *fpout; > + char tmpfile[MAXPGPATH]; > + char dumpfile[MAXPGPATH]; > + > + snprintf(tmpfile, sizeof(tmpfile), "pg_memusage/%d.tmp", MyProcPid); > + snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", MyProcPid); > + > + /* > + * Open a temp file to dump the current memory context. > + */ > + fpout = AllocateFile(tmpfile, PG_BINARY_W); > + if (fpout == NULL) > + { > + ereport(LOG, > + (errcode_for_file_access(), > + errmsg("could not write temporary memory context file \"%s\": %m", > + tmpfile))); > + return; > + } Probably should be opened with O_CREAT | O_TRUNC? Greetings, Andres Freund
On 2020-09-01 03:29, Pavel Stehule wrote: > Hi > > po 31. 8. 2020 v 17:03 odesílatel Kasahara Tatsuhito > <kasahara.tatsuhito@gmail.com> napsal: > >> Hi, >> >> On Mon, Aug 31, 2020 at 8:22 PM torikoshia >> <torikoshia@oss.nttdata.com> wrote: >>> As discussed in the thread[1], it'll be useful to make it >>> possible to get the memory contexts of an arbitrary backend >>> process. >> +1 >> >>> Attached PoC patch makes pg_get_backend_memory_contexts() >>> display meory contexts of the specified PID of the process. >> Thanks, it's a very good patch for discussion. >> >>> It doesn't display contexts of all the backends but only >>> the contexts of specified process. >> or we can "SELECT (pg_get_backend_memory_contexts(pid)).* FROM >> pg_stat_activity WHERE ...", >> so I don't think it's a big deal. >> >>> The rough idea of implementation is like below: >>> >>> 1. send a signal to the specified process >>> 2. signaled process dumps its memory contexts to a file >>> 3. read the dumped file and display it to the user >> I agree with the overview of the idea. >> Here are some comments and questions. Thanks for the comments! >> >> - Currently, "the signal transmission for dumping memory >> information" >> and "the read & output of dump information " >> are on the same interface, but I think it would be better to >> separate them. >> How about providing the following three types of functions for >> users? >> - send a signal to specified pid >> - check the status of the signal sent and received >> - read the dumped information Is this for future extensibility to make it possible to get other information like the current execution plan which was suggested by Pavel? If so, I agree with considering extensibility, but I'm not sure it's necessary whether providing these types of functions for 'users'. >> - How about managing the status of signal send/receive and dump >> operations on a shared hash or others ? >> Sending and receiving signals, dumping memory information, and >> referencing dump information all work asynchronously. >> Therefore, it would be good to have management information to >> check >> the status of each process. >> A simple idea is that .. >> - send a signal to dump to a PID, it first record following >> information into the shared hash. >> pid (specified pid) >> loc (dump location, currently might be ASAP) >> recv (did the pid process receive a signal? first false) >> dumped (did the pid process dump a mem information? first >> false) >> - specified process receive the signal, update the status in the >> shared hash, then dumped at specified location. >> - specified process finish dump mem information, update the >> status >> in the shared hash. Adding management information on shared memory seems necessary when we want to have more controls over dumping like 'dump location' or any other information such as 'current execution plan'. I'm going to consider this. >> - Does it allow one process to output multiple dump files? >> It appears to be a specification to overwrite at present, but I >> thought it would be good to be able to generate >> multiple dump files in different phases (e.g., planning phase and >> execution phase) in the future. >> - How is the dump file cleaned up? > > For a very long time there has been similar discussion about taking > session query and session execution plans from other sessions. > > I am not sure how necessary information is in the memory dump, but I > am sure so taking the current execution plan and complete text of the > current query is pretty necessary information. > > but can be great so this infrastructure can be used for any debugging > purpose. Thanks! It would be good if some part of this effort can be an infrastructure of other debugging. It may be hard, but I will keep your comment in mind. Regards, -- Atsushi Torikoshi NTT DATA CORPORATION > > Regards > > Pavel > >> Best regards, >> >> -- >> Tatsuhito Kasahara >> kasahara.tatsuhito _at_ gmail.com [1] > > > Links: > ------ > [1] http://gmail.com
Thanks for reviewing! I'm going to modify the patch according to your comments. On 2020-09-01 10:54, Andres Freund wrote: > Hi, > > On 2020-08-31 20:22:18 +0900, torikoshia wrote: >> After commit 3e98c0bafb28de, we can display the usage of the >> memory contexts using pg_backend_memory_contexts system >> view. >> >> However, its target is limited to the process attached to >> the current session. >> >> As discussed in the thread[1], it'll be useful to make it >> possible to get the memory contexts of an arbitrary backend >> process. >> >> Attached PoC patch makes pg_get_backend_memory_contexts() >> display meory contexts of the specified PID of the process. > > Awesome! > > >> It doesn't display contexts of all the backends but only >> the contexts of specified process. >> I think it would be enough because I suppose this function >> is used after investigations using ps command or other OS >> level utilities. > > It can be used as a building block if all are needed. Getting the > infrastructure right is the big thing here, I think. Adding more > detailed views on top of that data later is easier. > > > >> diff --git a/src/backend/catalog/system_views.sql >> b/src/backend/catalog/system_views.sql >> index a2d61302f9..88fb837ecd 100644 >> --- a/src/backend/catalog/system_views.sql >> +++ b/src/backend/catalog/system_views.sql >> @@ -555,10 +555,10 @@ REVOKE ALL ON pg_shmem_allocations FROM PUBLIC; >> REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC; >> >> CREATE VIEW pg_backend_memory_contexts AS >> - SELECT * FROM pg_get_backend_memory_contexts(); >> + SELECT * FROM pg_get_backend_memory_contexts(-1); > > -1 is odd. Why not use NULL or even 0? > >> + else >> + { >> + int rc; >> + int parent_len = strlen(parent); >> + int name_len = strlen(name); >> + >> + /* >> + * write out the current memory context information. >> + * Since some elements of values are reusable, we write it out. > > Not sure what the second comment line here is supposed to mean? > > >> + */ >> + fputc('D', fpout); >> + rc = fwrite(values, sizeof(values), 1, fpout); >> + rc = fwrite(nulls, sizeof(nulls), 1, fpout); >> + >> + /* write out information which is not resuable from serialized >> values */ > > s/resuable/reusable/ > > >> + rc = fwrite(&name_len, sizeof(int), 1, fpout); >> + rc = fwrite(name, name_len, 1, fpout); >> + rc = fwrite(&idlen, sizeof(int), 1, fpout); >> + rc = fwrite(clipped_ident, idlen, 1, fpout); >> + rc = fwrite(&level, sizeof(int), 1, fpout); >> + rc = fwrite(&parent_len, sizeof(int), 1, fpout); >> + rc = fwrite(parent, parent_len, 1, fpout); >> + (void) rc; /* we'll check for error with ferror */ >> + >> + } > > This format is not descriptive. How about serializing to json or > something? Or at least having field names? > > Alternatively, build the same tuple we build for the SRF, and serialize > that. Then there's basically no conversion needed. > > >> @@ -117,6 +157,8 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate >> *tupstore, >> Datum >> pg_get_backend_memory_contexts(PG_FUNCTION_ARGS) >> { >> + int pid = PG_GETARG_INT32(0); >> + >> ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; >> TupleDesc tupdesc; >> Tuplestorestate *tupstore; >> @@ -147,11 +189,258 @@ >> pg_get_backend_memory_contexts(PG_FUNCTION_ARGS) >> >> MemoryContextSwitchTo(oldcontext); >> >> - PutMemoryContextsStatsTupleStore(tupstore, tupdesc, >> - TopMemoryContext, NULL, 0); >> + if (pid == -1) >> + { >> + /* >> + * Since pid -1 indicates target is the local process, simply >> + * traverse memory contexts. >> + */ >> + PutMemoryContextsStatsTupleStore(tupstore, tupdesc, >> + TopMemoryContext, "", 0, NULL); >> + } >> + else >> + { >> + /* >> + * Send signal for dumping memory contexts to the target process, >> + * and read the dumped file. >> + */ >> + FILE *fpin; >> + char dumpfile[MAXPGPATH]; >> + >> + SendProcSignal(pid, PROCSIG_DUMP_MEMORY, InvalidBackendId); >> + >> + snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", pid); >> + >> + while (true) >> + { >> + CHECK_FOR_INTERRUPTS(); >> + >> + pg_usleep(10000L); >> + > > Need better signalling back/forth here. Do you mean I should also send another signal from the dumped process to the caller of the pg_get_backend_memory_contexts() when it finishes dumping? Regards, -- Atsushi Torikoshi NTT DATA CORPORATION > > > >> +/* >> + * dump_memory_contexts >> + * Dumping local memory contexts to a file. >> + * This function does not delete the file as it is intended to be >> read by >> + * another process. >> + */ >> +static void >> +dump_memory_contexts(void) >> +{ >> + FILE *fpout; >> + char tmpfile[MAXPGPATH]; >> + char dumpfile[MAXPGPATH]; >> + >> + snprintf(tmpfile, sizeof(tmpfile), "pg_memusage/%d.tmp", MyProcPid); >> + snprintf(dumpfile, sizeof(dumpfile), "pg_memusage/%d", MyProcPid); >> + >> + /* >> + * Open a temp file to dump the current memory context. >> + */ >> + fpout = AllocateFile(tmpfile, PG_BINARY_W); >> + if (fpout == NULL) >> + { >> + ereport(LOG, >> + (errcode_for_file_access(), >> + errmsg("could not write temporary memory context file \"%s\": >> %m", >> + tmpfile))); >> + return; >> + } > > Probably should be opened with O_CREAT | O_TRUNC? > > > Greetings, > > Andres Freund
Hi, On Thu, Sep 3, 2020 at 3:38 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > >> - Currently, "the signal transmission for dumping memory > >> information" > >> and "the read & output of dump information " > >> are on the same interface, but I think it would be better to > >> separate them. > >> How about providing the following three types of functions for > >> users? > >> - send a signal to specified pid > >> - check the status of the signal sent and received > >> - read the dumped information > > Is this for future extensibility to make it possible to get > other information like the current execution plan which was > suggested by Pavel? Yes, but it's not only for future expansion, but also for the usability and the stability of this feature. For example, if you want to read one dumped file multiple times and analyze it, you will want the ability to just read the dump. Moreover, when it takes a long time from the receive the signal to the dump output, or the dump output itself takes a long time, users can investigate where it takes time if each process is separated. > If so, I agree with considering extensibility, but I'm not > sure it's necessary whether providing these types of > functions for 'users'. Of course, it is possible and may be necessary to provide a wrapped sequence of processes from sending a signal to reading dump files. But IMO, some users would like to perform the signal transmission, state management and dump file reading processes separately. Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes: > Yes, but it's not only for future expansion, but also for the > usability and the stability of this feature. > For example, if you want to read one dumped file multiple times and analyze it, > you will want the ability to just read the dump. If we design it to make that possible, how are we going to prevent disk space leaks from never-cleaned-up dump files? regards, tom lane
On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes: > > Yes, but it's not only for future expansion, but also for the > > usability and the stability of this feature. > > For example, if you want to read one dumped file multiple times and analyze it, > > you will want the ability to just read the dump. > > If we design it to make that possible, how are we going to prevent disk > space leaks from never-cleaned-up dump files? In my thought, with features such as a view that allows us to see a list of dumped files, it would be better to have a function that simply deletes the dump files associated with a specific PID, or to delete all dump files. Some files may be dumped with unexpected delays, so I think the cleaning feature will be necessary. ( Also, as the pgsql_tmp file, it might better to delete dump files when PostgreSQL start.) Or should we try to delete the dump file as soon as we can read it? Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote: >On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes: >> > Yes, but it's not only for future expansion, but also for the >> > usability and the stability of this feature. >> > For example, if you want to read one dumped file multiple times and analyze it, >> > you will want the ability to just read the dump. >> >> If we design it to make that possible, how are we going to prevent disk >> space leaks from never-cleaned-up dump files? >In my thought, with features such as a view that allows us to see a >list of dumped files, >it would be better to have a function that simply deletes the dump >files associated with a specific PID, >or to delete all dump files. >Some files may be dumped with unexpected delays, so I think the >cleaning feature will be necessary. >( Also, as the pgsql_tmp file, it might better to delete dump files >when PostgreSQL start.) > >Or should we try to delete the dump file as soon as we can read it? > IMO making the cleanup a responsibility of the users (e.g. by exposing the list of dumped files through a view and expecting users to delete them in some way) is rather fragile. I don't quite see what's the point of designing it this way. It was suggested this improves stability and usability of this feature, but surely making it unnecessarily complex contradicts both points? IMHO if the user needs to process the dump repeatedly, what's preventing him/her from storing it in a file, or something like that? At that point it's clear it's up to them to remove the file. So I suggest to keep the feature as simple as possible - hand the dump over and delete. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-09-04 21:46, Tomas Vondra wrote: > On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote: >> On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes: >>> > Yes, but it's not only for future expansion, but also for the >>> > usability and the stability of this feature. >>> > For example, if you want to read one dumped file multiple times and analyze it, >>> > you will want the ability to just read the dump. >>> >>> If we design it to make that possible, how are we going to prevent >>> disk >>> space leaks from never-cleaned-up dump files? >> In my thought, with features such as a view that allows us to see a >> list of dumped files, >> it would be better to have a function that simply deletes the dump >> files associated with a specific PID, >> or to delete all dump files. >> Some files may be dumped with unexpected delays, so I think the >> cleaning feature will be necessary. >> ( Also, as the pgsql_tmp file, it might better to delete dump files >> when PostgreSQL start.) >> >> Or should we try to delete the dump file as soon as we can read it? >> > > IMO making the cleanup a responsibility of the users (e.g. by exposing > the list of dumped files through a view and expecting users to delete > them in some way) is rather fragile. > > I don't quite see what's the point of designing it this way. It was > suggested this improves stability and usability of this feature, but > surely making it unnecessarily complex contradicts both points? > > IMHO if the user needs to process the dump repeatedly, what's > preventing > him/her from storing it in a file, or something like that? At that > point > it's clear it's up to them to remove the file. So I suggest to keep the > feature as simple as possible - hand the dump over and delete. +1. If there are no other objections, I'm going to accept this suggestion. Regards
Hi, On Thu, Sep 10, 2020 at 8:53 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2020-09-04 21:46, Tomas Vondra wrote: > > On Fri, Sep 04, 2020 at 11:47:30AM +0900, Kasahara Tatsuhito wrote: > >> On Fri, Sep 4, 2020 at 2:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> writes: > >>> > Yes, but it's not only for future expansion, but also for the > >>> > usability and the stability of this feature. > >>> > For example, if you want to read one dumped file multiple times and analyze it, > >>> > you will want the ability to just read the dump. > >>> > >>> If we design it to make that possible, how are we going to prevent > >>> disk > >>> space leaks from never-cleaned-up dump files? > >> In my thought, with features such as a view that allows us to see a > >> list of dumped files, > >> it would be better to have a function that simply deletes the dump > >> files associated with a specific PID, > >> or to delete all dump files. > >> Some files may be dumped with unexpected delays, so I think the > >> cleaning feature will be necessary. > >> ( Also, as the pgsql_tmp file, it might better to delete dump files > >> when PostgreSQL start.) > >> > >> Or should we try to delete the dump file as soon as we can read it? > >> > > > > IMO making the cleanup a responsibility of the users (e.g. by exposing > > the list of dumped files through a view and expecting users to delete > > them in some way) is rather fragile. > > > > I don't quite see what's the point of designing it this way. It was > > suggested this improves stability and usability of this feature, but > > surely making it unnecessarily complex contradicts both points? > > > > IMHO if the user needs to process the dump repeatedly, what's > > preventing > > him/her from storing it in a file, or something like that? At that > > point > > it's clear it's up to them to remove the file. So I suggest to keep the > > feature as simple as possible - hand the dump over and delete. Yeah, it might be better to avoid making the user responsible for removal. I think it's fine to have an interface to delete in an emergency, but I agree that users shouldn't be made aware of the existence or deletion of dump files, basically. > +1. > If there are no other objections, I'm going to accept this > suggestion. So +1 Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
On Thu, Sep 10, 2020 at 09:11:21PM +0900, Kasahara Tatsuhito wrote: > I think it's fine to have an interface to delete in an emergency, but > I agree that > users shouldn't be made aware of the existence or deletion of dump > files, basically. Per the CF bot, the number of tests needs to be tweaked, because we test each entry filtered out with is_deeply(), meaning that the number of tests needs to be updated to reflect that if the filtered list is changed: t/010_pg_basebackup.pl ... 104/109 # Looks like you planned 109 tests but ran 110. t/010_pg_basebackup.pl ... Dubious, test returned 255 (wstat 65280, 0xff00) All 109 subtests passed Simple enough to fix. -- Michael
Attachment
Hi, Thanks for all your comments, I updated the patch. On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> wrote: > - How about managing the status of signal send/receive and dump > operations on a shared hash or others ? > Sending and receiving signals, dumping memory information, and > referencing dump information all work asynchronously. > Therefore, it would be good to have management information to > check > the status of each process. > A simple idea is that .. > - send a signal to dump to a PID, it first record following > information into the shared hash. > pid (specified pid) > loc (dump location, currently might be ASAP) > recv (did the pid process receive a signal? first false) > dumped (did the pid process dump a mem information? first > false) > - specified process receive the signal, update the status in the > shared hash, then dumped at specified location. > - specified process finish dump mem information, update the > status > in the shared hash. I added a shared hash table consisted of minimal members mainly for managing whether the file is dumped or not. Some members like 'loc' seem useful in the future, but I haven't added them since it's not essential at this point. On 2020-09-01 10:54, Andres Freund wrote: >> CREATE VIEW pg_backend_memory_contexts AS >> - SELECT * FROM pg_get_backend_memory_contexts(); >> + SELECT * FROM pg_get_backend_memory_contexts(-1); > > -1 is odd. Why not use NULL or even 0? Changed it from -1 to NULL. >> + rc = fwrite(&name_len, sizeof(int), 1, fpout); >> + rc = fwrite(name, name_len, 1, fpout); >> + rc = fwrite(&idlen, sizeof(int), 1, fpout); >> + rc = fwrite(clipped_ident, idlen, 1, fpout); >> + rc = fwrite(&level, sizeof(int), 1, fpout); >> + rc = fwrite(&parent_len, sizeof(int), 1, fpout); >> + rc = fwrite(parent, parent_len, 1, fpout); >> + (void) rc; /* we'll check >> for error with ferror */ >> + >> + } > This format is not descriptive. How about serializing to json or > something? Or at least having field names? Added field names for each value. >> + while (true) >> + { >> + CHECK_FOR_INTERRUPTS(); >> + >> + pg_usleep(10000L); >> + > > Need better signalling back/forth here. Added a shared hash table that has a flag for managing whether the file is dumped or not. I modified it to use this flag. >> + /* >> + * Open a temp file to dump the current memory context. >> + */ >> + fpout = AllocateFile(tmpfile, PG_BINARY_W); >> + if (fpout == NULL) >> + { >> + ereport(LOG, >> + (errcode_for_file_access(), >> + errmsg("could not write temporary >> memory context file \"%s\": %m", >> + tmpfile))); >> + return; >> + } > > Probably should be opened with O_CREAT | O_TRUNC? AllocateFile() calls fopen() and AFAIU fopen() with mode "w" corresponds to open() with "O_WRONLY | O_CREAT | O_TRUNC". Do you mean I should not use fopen() here? On 2020-09-24 13:01, Michael Paquier wrote: > On Thu, Sep 10, 2020 at 09:11:21PM +0900, Kasahara Tatsuhito wrote: >> I think it's fine to have an interface to delete in an emergency, but >> I agree that >> users shouldn't be made aware of the existence or deletion of dump >> files, basically. > > Per the CF bot, the number of tests needs to be tweaked, because we > test each entry filtered out with is_deeply(), meaning that the number > of tests needs to be updated to reflect that if the filtered list is > changed: > t/010_pg_basebackup.pl ... 104/109 # Looks like you planned 109 tests > but ran 110. > t/010_pg_basebackup.pl ... Dubious, test returned 255 (wstat 65280, > 0xff00) > All 109 subtests passed > > Simple enough to fix. Incremented the number of tests. Any thoughts? Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Attachment
Hi, On Fri, Sep 25, 2020 at 4:28 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > Thanks for all your comments, I updated the patch. Thanks for updating the patch. I did a brief test and code review. > I added a shared hash table consisted of minimal members > mainly for managing whether the file is dumped or not. > Some members like 'loc' seem useful in the future, but I > haven't added them since it's not essential at this point. Yes, that would be good. + /* + * Since we allow only one session can request to dump memory context at + * the same time, check whether the dump files already exist. + */ + while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile, &stat_tmp) == 0) + { + pg_usleep(1000000L); + } If pg_get_backend_memory_contexts() is executed by two or more sessions at the same time, it cannot be run exclusively in this way. Currently it seems to cause a crash when do it so. This is easy to reproduce and can be done as follows. [session-1] BEGIN; LOCK TABKE t1; [Session-2] BEGIN; LOCK TABLE t1; <- waiting [Session-3] select * FROM pg_get_backend_memory_contexts(<pid of session-2>); [Session-4] select * FROM pg_get_backend_memory_contexts(<pid of session-2>); If you issue commit or abort at session-1, you will get SEGV. Instead of checking for the existence of the file, it might be better to use a hash (mcxtdumpHash) entry with LWLock. + if (proc == NULL) + { + ereport(WARNING, + (errmsg("PID %d is not a PostgreSQL server process", dst_pid))); + return (Datum) 1; + } Shouldn't it clear the hash entry before return? + /* Wait until target process finished dumping file. */ + while (!entry->is_dumped) + { + CHECK_FOR_INTERRUPTS(); + pg_usleep(10000L); + } If the target is killed and exit before dumping the memory information, you're in an infinite loop here. So how about making sure that any process that has to stop before doing a memory dump changes the status of the hash (entry->is_dumped) before stopping and the caller detects it? I'm not sure it's best or not, but you might want to use something like the on_shmem_exit callback. In the current design, if the caller stops processing before reading the dumped file, you will have an orphaned file. It looks like the following. [session-1] BEGIN; LOCK TABKE t1; [Session-2] BEGIN; LOCK TABLE t1; <- waiting [Session-3] select * FROM pg_get_backend_memory_contexts(<pid of session-2>); If you cancel or terminate the session-3, then issue commit or abort at session-1, you will get orphaned files in pg_memusage. So if you allow only one session can request to dump file, it could call pg_memusage_reset() before send the signal in this function. Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
> On Thu, Oct 1, 2020 at 4:06 PM Kasahara Tatsuhito > <kasahara.tatsuhito@gmail.com> wrote: > Hi, > > On Fri, Sep 25, 2020 at 4:28 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: > > Thanks for all your comments, I updated the patch. > Thanks for updating the patch. > I did a brief test and code review. Thanks for your tests and review! > > I added a shared hash table consisted of minimal members > > mainly for managing whether the file is dumped or not. > > Some members like 'loc' seem useful in the future, but I > > haven't added them since it's not essential at this point. > Yes, that would be good. > > + /* > + * Since we allow only one session can request to dump > memory context at > + * the same time, check whether the dump files already exist. > + */ > + while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile, > &stat_tmp) == 0) > + { > + pg_usleep(1000000L); > + } > > If pg_get_backend_memory_contexts() is executed by two or more > sessions at the same time, it cannot be run exclusively in this way. > Currently it seems to cause a crash when do it so. > This is easy to reproduce and can be done as follows. > > [session-1] > BEGIN; > LOCK TABKE t1; > [Session-2] > BEGIN; > LOCK TABLE t1; <- waiting > [Session-3] > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); > [Session-4] > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); > > If you issue commit or abort at session-1, you will get SEGV. > > Instead of checking for the existence of the file, it might be better > to use a hash (mcxtdumpHash) entry with LWLock. Thanks! Added a LWLock and changed the way from checking the file existence to finding the hash entry. > + if (proc == NULL) > + { > + ereport(WARNING, > + (errmsg("PID %d is not a PostgreSQL server > process", dst_pid))); > + return (Datum) 1; > + } > > Shouldn't it clear the hash entry before return? Yeah. added codes for removing the entry. > > + /* Wait until target process finished dumping file. */ > + while (!entry->is_dumped) > + { > + CHECK_FOR_INTERRUPTS(); > + pg_usleep(10000L); > + } > > If the target is killed and exit before dumping the memory > information, you're in an infinite loop here. > So how about making sure that any process that has to stop before > doing a memory dump changes the status of the hash (entry->is_dumped) > before stopping and the caller detects it? > I'm not sure it's best or not, but you might want to use something > like the on_shmem_exit callback. Thanks for your idea! Added a callback to change the status of the hash table entry. Although I think it's necessary to remove this callback when it finished processing memory dumping, on_shmem_exit() does not seem to have such a function. I used before_shmem_exit() since it has a corresponding function to remove registered callback. If it's inappropriate, I'm going to add a function removing the registered callback of on_shmem_exit(). > In the current design, if the caller stops processing before reading > the dumped file, you will have an orphaned file. > It looks like the following. > > [session-1] > BEGIN; > LOCK TABKE t1; > [Session-2] > BEGIN; > LOCK TABLE t1; <- waiting > [Session-3] > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); > > If you cancel or terminate the session-3, then issue commit or abort > at session-1, you will get orphaned files in pg_memusage. > > So if you allow only one session can request to dump file, it could > call pg_memusage_reset() before send the signal in this function. Although I'm going to allow only one session per one target process, I'd like to allow running multiple pg_get_backend_memory_contexts() which target process is different. Instead of calling pg_memusage_reset(), I added a callback for cleaning up orphaned files and the elements of the hash table using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP(). I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP() here since it can handle not only termination but also cancellation. Any thoughts? -- Atsushi Torikoshi
Attachment
Wait... > Attachments: 0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch For a moment I thought that the number is patch number but the predecessors are 0002-Enabled..collect.patch and 0001-(same name). It's not mandatory but we usually do as the follows and it's the way of git. v1-0001-Enabled...collect.patch v2-0001-Enabled...collect.patch The vn is added by -v option for git-format-patch. At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in > > > I added a shared hash table consisted of minimal members > > > mainly for managing whether the file is dumped or not. > > > Some members like 'loc' seem useful in the future, but I > > > haven't added them since it's not essential at this point. > > Yes, that would be good. > > + /* > > + * Since we allow only one session can request to dump > > memory context at > > + * the same time, check whether the dump files already exist. > > + */ > > + while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile, > > &stat_tmp) == 0) > > + { > > + pg_usleep(1000000L); > > + } > > If pg_get_backend_memory_contexts() is executed by two or more > > sessions at the same time, it cannot be run exclusively in this way. > > Currently it seems to cause a crash when do it so. > > This is easy to reproduce and can be done as follows. > > [session-1] > > BEGIN; > > LOCK TABKE t1; > > [Session-2] > > BEGIN; > > LOCK TABLE t1; <- waiting > > [Session-3] > > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); > > [Session-4] > > select * FROM pg_get_backend_memory_contexts(<pid of > > session-2>); > > If you issue commit or abort at session-1, you will get SEGV. > > Instead of checking for the existence of the file, it might be better > > to use a hash (mcxtdumpHash) entry with LWLock. > > Thanks! > Added a LWLock and changed the way from checking the file existence > to finding the hash entry. > > + if (proc == NULL) > > + { > > + ereport(WARNING, > > + (errmsg("PID %d is not a PostgreSQL server > > process", dst_pid))); > > + return (Datum) 1; > > + } > > Shouldn't it clear the hash entry before return? > > Yeah. added codes for removing the entry. + entry = AddEntryToMcxtdumpHash(dst_pid); + + /* Check whether the target process is PostgreSQL backend process. */ + /* TODO: Check also whether backend or not. */ + proc = BackendPidGetProc(dst_pid); + + if (proc == NULL) + { + ereport(WARNING, + (errmsg("PID %d is not a PostgreSQL server process", dst_pid))); + + LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE); + + if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, NULL) == NULL) + elog(WARNING, "hash table corrupted"); + + LWLockRelease(McxtDumpHashLock); + + return (Datum) 1; + } Why do you enter a useles entry then remove it immedately? + PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid)); + { + SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId); "PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about "PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"? I thought that the hash table would prevent multiple reqestors from making a request at once, but the patch doesn't seem to do that. + /* Wait until target process finished dumping file. */ + while (entry->dump_status == MCXTDUMPSTATUS_NOTYET) This needs LWLock. And this could read the entry after reused by another backend if the dumper process is gone. That isn't likely to happen, but theoretically the other backend may set it to MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member. + /* + * Make dump file ends with 'D'. + * This is checked by the caller when reading the file. + */ + fputc('E', fpout); Which is right? + fputc('E', fpout); + + CHECK_FOR_INTERRUPTS(); This means that the process accepts another request and rewrite the file even while the first requester is reading it. And, the file can be removed by the first requestor before the second requestor can read it. + mcxtdumpHash = ShmemInitHash("mcxtdump hash", + SHMEM_MEMCONTEXT_SIZE, + SHMEM_MEMCONTEXT_SIZE, The space needed for this hash doesn't seem to be secured. The hash is sized to 64 entries, so pg_get_backend_memory_contexts() may fail with ERROR "out of shared memory". The hash is used only to check if the dump file is completed and if ended with error. If we need only those, an shared byte array with the length of max_backend is sufficient. + PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid)); + { + SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId); + + /* Wait until target process finished dumping file. */ + while (entry->dump_status == MCXTDUMPSTATUS_NOTYET) + { + CHECK_FOR_INTERRUPTS(); + pg_usleep(10000L); + } + } + PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid)); + + if (entry->dump_status == MCXTDUMPSTATUS_ERROR) + { .. + if (stat(tmpfile, &stat_tmp) == 0) + unlink(tmpfile); + if (stat(dumpfile, &stat_tmp) == 0) + unlink(dumpfile); ... + return (Datum) 0; + } + + /* Read values from the dumped file and put them on tuplestore. */ + PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc, dst_pid); This means that if the function gets sigint before the dumper creates the file, the dumper can leave a dump file? > > + /* Wait until target process finished dumping file. */ > > + while (!entry->is_dumped) > > + { > > + CHECK_FOR_INTERRUPTS(); > > + pg_usleep(10000L); > > + } > > If the target is killed and exit before dumping the memory > > information, you're in an infinite loop here. > > So how about making sure that any process that has to stop before > > doing a memory dump changes the status of the hash (entry->is_dumped) > > before stopping and the caller detects it? > > I'm not sure it's best or not, but you might want to use something > > like the on_shmem_exit callback. > > Thanks for your idea! > Added a callback to change the status of the hash table entry. > > Although I think it's necessary to remove this callback when it > finished > processing memory dumping, on_shmem_exit() does not seem to have such > a function. > I used before_shmem_exit() since it has a corresponding function to > remove registered callback. > If it's inappropriate, I'm going to add a function removing the > registered callback of on_shmem_exit(). This seems to leave a file for the pid. > > In the current design, if the caller stops processing before reading > > the dumped file, you will have an orphaned file. > > It looks like the following. > > [session-1] > > BEGIN; > > LOCK TABKE t1; > > [Session-2] > > BEGIN; > > LOCK TABLE t1; <- waiting > > [Session-3] > > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); > > If you cancel or terminate the session-3, then issue commit or abort > > at session-1, you will get orphaned files in pg_memusage. > > So if you allow only one session can request to dump file, it could > > call pg_memusage_reset() before send the signal in this function. > > Although I'm going to allow only one session per one target process, > I'd like to allow running multiple pg_get_backend_memory_contexts() > which target process is different. > > Instead of calling pg_memusage_reset(), I added a callback for > cleaning up orphaned files and the elements of the hash table > using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and > PG_END_ENSURE_ERROR_CLEANUP(). > > I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP() > here since it can handle not only termination but also cancellation. > > Any thoughts? +/* + * pg_memusage_reset + * Remove the memory context dump files. + */ +void +pg_memusage_reset(int pid) The function name seem to represents somewhat different from what it does. I think we might need to step-back to basic design of this feature since this patch seems to have unhandled corner cases that are difficult to find. - Dump file lifecycle or state-transition of the dumper Currently the lifecycle of a dump file, or the state-transition of the dumper process doesn't seem to be well defined. The file is create only by the dumper. If the requestor reads the completed file, the reader removes it. If the dumper receives a cancel request, the dumper removes it if any. Of course dumper removes it if it fails to complete the file. - Better way of signalling between the requestor and the dumper I think there's no doubt about request signal. About the complete signal, currently the requestor polls on a flag in a hash entry. I'm wondering if we can get rid of polling. The problem on doing that is the lack of a means for a dumper to know the requestor. We need to store requestor pid or backend id in the shared hash entry. By the way, about shared hash entry, it uses about 70kB for only 64 entries so it seems inefficient than a shared array that has MaxBackends entries. If we used a following array on shared memory, struct hoge { BackendId requestor[MAX_BACKENDS]; int status[MAX_BACKENDS]; LWLock lock; }; This array has the size of 24 * MaxBackends + 16. 24kB for 1000 backends. It could be on dsm since this feature is not used commonly. - The way to cancel a request already made. (or management of the dump state transition.) Cancellation seems to contain some race conditions. But basically that could be done by sending a request signal after setting the hoge.requestor above to some special value, not needing the third type of signal. The special value should be different from the initial state, which signals that the process is accepting a new request. As the whole, that would looks like the folloing? ------------------------------------------------------------ Successful request. Requestor dumper state [idle] initial [request] -------------------> requestor pid/backendid -signal-> [dumping] <-signal-[done] [read] [done] --------------------> initial ------------------------------------------------------------ On failure, the dumper signals with setting state to initial. [request] -------------------> requestor pid/backendid -signal-> [dumping] [failed] initial <-signal- (some other requestor might come meantime.) <sees that the requestor is not me> [failed] ------------------------------------------------------------ If the requestor wants to cancel the request, it sets the state to 'cancel' then signal. Requestor dumper state [idle] initial [request] -------------------> cancel <if canceled. clean up> [dumping] <if canceled. clean up> <-signal-[done] -signal-><try to clean up> Other aspects to cnosider? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2020-10-23 13:46, Kyotaro Horiguchi wrote: > Wait... > >> Attachments: >> 0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch > > For a moment I thought that the number is patch number but the > predecessors are 0002-Enabled..collect.patch and 0001-(same > name). It's not mandatory but we usually do as the follows and it's > the way of git. > > v1-0001-Enabled...collect.patch > v2-0001-Enabled...collect.patch > > The vn is added by -v option for git-format-patch. Sorry for the confusion. I'll follow that way next time. > At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia > <torikoshia@oss.nttdata.com> wrote in >> > > I added a shared hash table consisted of minimal members >> > > mainly for managing whether the file is dumped or not. >> > > Some members like 'loc' seem useful in the future, but I >> > > haven't added them since it's not essential at this point. >> > Yes, that would be good. >> > + /* >> > + * Since we allow only one session can request to dump >> > memory context at >> > + * the same time, check whether the dump files already exist. >> > + */ >> > + while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile, >> > &stat_tmp) == 0) >> > + { >> > + pg_usleep(1000000L); >> > + } >> > If pg_get_backend_memory_contexts() is executed by two or more >> > sessions at the same time, it cannot be run exclusively in this way. >> > Currently it seems to cause a crash when do it so. >> > This is easy to reproduce and can be done as follows. >> > [session-1] >> > BEGIN; >> > LOCK TABKE t1; >> > [Session-2] >> > BEGIN; >> > LOCK TABLE t1; <- waiting >> > [Session-3] >> > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); >> > [Session-4] >> > select * FROM pg_get_backend_memory_contexts(<pid of >> > session-2>); >> > If you issue commit or abort at session-1, you will get SEGV. >> > Instead of checking for the existence of the file, it might be better >> > to use a hash (mcxtdumpHash) entry with LWLock. >> >> Thanks! >> Added a LWLock and changed the way from checking the file existence >> to finding the hash entry. > >> > + if (proc == NULL) >> > + { >> > + ereport(WARNING, >> > + (errmsg("PID %d is not a PostgreSQL server >> > process", dst_pid))); >> > + return (Datum) 1; >> > + } >> > Shouldn't it clear the hash entry before return? >> >> Yeah. added codes for removing the entry. > > + entry = AddEntryToMcxtdumpHash(dst_pid); > + > + /* Check whether the target process is PostgreSQL backend process. > */ > + /* TODO: Check also whether backend or not. */ > + proc = BackendPidGetProc(dst_pid); > + > + if (proc == NULL) > + { > + ereport(WARNING, > + (errmsg("PID %d is not a PostgreSQL server process", dst_pid))); > + > + LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE); > + > + if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, NULL) == NULL) > + elog(WARNING, "hash table corrupted"); > + > + LWLockRelease(McxtDumpHashLock); > + > + return (Datum) 1; > + } > > Why do you enter a useles entry then remove it immedately? Do you mean I should check the process existence first since it enables us to skip entering hash entries? > > + PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) > Int32GetDatum(dst_pid)); > + { > + SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId); > > "PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about > "PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"? I'll go with "PROCSIG_DUMP_MEMCXT". > > I thought that the hash table would prevent multiple reqestors from > making a request at once, but the patch doesn't seem to do that. > > + /* Wait until target process finished dumping file. */ > + while (entry->dump_status == MCXTDUMPSTATUS_NOTYET) > > This needs LWLock. And this could read the entry after reused by > another backend if the dumper process is gone. That isn't likely to > happen, but theoretically the other backend may set it to > MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member. Thanks for your notification. I'll use an LWLock. > > + /* > + * Make dump file ends with 'D'. > + * This is checked by the caller when reading the file. > + */ > + fputc('E', fpout); > > Which is right? Sorry, the comment was wrong.. > > + fputc('E', fpout); > + > + CHECK_FOR_INTERRUPTS(); > > This means that the process accepts another request and rewrite the > file even while the first requester is reading it. And, the file can > be removed by the first requestor before the second requestor can read > it. I added CHECK_FOR_INTERRUPTS() here to make the dump cancellation possible, however, considering your indication, it needs to think about a way to handle only the dump cancellation. > > + mcxtdumpHash = ShmemInitHash("mcxtdump hash", > + SHMEM_MEMCONTEXT_SIZE, > + SHMEM_MEMCONTEXT_SIZE, > . > The space needed for this hash doesn't seem to be secured. The hash is > sized to 64 entries, so pg_get_backend_memory_contexts() may fail with > ERROR "out of shared memory". > > The hash is used only to check if the dump file is completed and if > ended with error. If we need only those, an shared byte array with the > length of max_backend is sufficient. Although there was a comment that controlling dump location may be a good idea, but it also seems possible to include the location information in the struct Hoge you suggested below. On Tue, Sep 1, 2020 at 12:03 AM Kasahara Tatsuhito <[hidden email]> wrote: | - send a signal to dump to a PID, it first record following information into the shared hash. | pid (specified pid) | loc (dump location, currently might be ASAP) > > + PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) > Int32GetDatum(dst_pid)); > + { > + SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId); > + > + /* Wait until target process finished dumping file. */ > + while (entry->dump_status == MCXTDUMPSTATUS_NOTYET) > + { > + CHECK_FOR_INTERRUPTS(); > + pg_usleep(10000L); > + } > + } > + PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) > Int32GetDatum(dst_pid)); > + > + if (entry->dump_status == MCXTDUMPSTATUS_ERROR) > + { > .. > + if (stat(tmpfile, &stat_tmp) == 0) > + unlink(tmpfile); > + if (stat(dumpfile, &stat_tmp) == 0) > + unlink(dumpfile); > ... > + return (Datum) 0; > + } > + > + /* Read values from the dumped file and put them on tuplestore. */ > + PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc, dst_pid); > > This means that if the function gets sigint before the dumper creates > the file, the dumper can leave a dump file? In that case, the requestor removes the corresponding hash entry in the callback McxtReqKill, then the dumper who can not find the hash entry does not dump a file. However, I've now noticed that when the requestor gets sigint just after the dumper check, the dump file remains. In the current design, only the requestor can remove the dump file, but it seems necessary to allow the dumper to remove it. >> > + /* Wait until target process finished dumping file. */ >> > + while (!entry->is_dumped) >> > + { >> > + CHECK_FOR_INTERRUPTS(); >> > + pg_usleep(10000L); >> > + } >> > If the target is killed and exit before dumping the memory >> > information, you're in an infinite loop here. >> > So how about making sure that any process that has to stop before >> > doing a memory dump changes the status of the hash (entry->is_dumped) >> > before stopping and the caller detects it? >> > I'm not sure it's best or not, but you might want to use something >> > like the on_shmem_exit callback. >> >> Thanks for your idea! >> Added a callback to change the status of the hash table entry. >> >> Although I think it's necessary to remove this callback when it >> finished >> processing memory dumping, on_shmem_exit() does not seem to have such >> a function. >> I used before_shmem_exit() since it has a corresponding function to >> remove registered callback. >> If it's inappropriate, I'm going to add a function removing the >> registered callback of on_shmem_exit(). > > This seems to leave a file for the pid. As mentioned above, there can be a chance to remain files. > >> > In the current design, if the caller stops processing before reading >> > the dumped file, you will have an orphaned file. >> > It looks like the following. >> > [session-1] >> > BEGIN; >> > LOCK TABKE t1; >> > [Session-2] >> > BEGIN; >> > LOCK TABLE t1; <- waiting >> > [Session-3] >> > select * FROM pg_get_backend_memory_contexts(<pid of session-2>); >> > If you cancel or terminate the session-3, then issue commit or abort >> > at session-1, you will get orphaned files in pg_memusage. >> > So if you allow only one session can request to dump file, it could >> > call pg_memusage_reset() before send the signal in this function. >> >> Although I'm going to allow only one session per one target process, >> I'd like to allow running multiple pg_get_backend_memory_contexts() >> which target process is different. >> >> Instead of calling pg_memusage_reset(), I added a callback for >> cleaning up orphaned files and the elements of the hash table >> using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and >> PG_END_ENSURE_ERROR_CLEANUP(). >> >> I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP() >> here since it can handle not only termination but also cancellation. >> >> Any thoughts? > > +/* > + * pg_memusage_reset > + * Remove the memory context dump files. > + */ > +void > +pg_memusage_reset(int pid) > > The function name seem to represents somewhat different from what it > does. Yeah, It looks like it actually reset the memory. I'll rename it to remove_memcxt_file or something. > > I think we might need to step-back to basic design of this feature > since this patch seems to have unhandled corner cases that are > difficult to find. Agreed and thanks for writing it down below. > - Dump file lifecycle or state-transition of the dumper > > Currently the lifecycle of a dump file, or the state-transition of > the dumper process doesn't seem to be well defined. > > The file is create only by the dumper. > If the requestor reads the completed file, the reader removes it. > > If the dumper receives a cancel request, the dumper removes it if > any. > > Of course dumper removes it if it fails to complete the file. > > - Better way of signalling between the requestor and the dumper > > I think there's no doubt about request signal. > > About the complete signal, currently the requestor polls on a flag > in a hash entry. I'm wondering if we can get rid of polling. The > problem on doing that is the lack of a means for a dumper to know > the requestor. We need to store requestor pid or backend id in the > shared hash entry. Agreed to get rid of polling. BTW, it seems common to use a latch instead of pg_usleep() to wait until signals arrive as far as I read latch.h. I'm now thinking about using a latch here and it would make polling removed. > By the way, about shared hash entry, it uses about 70kB for only 64 > entries so it seems inefficient than a shared array that has > MaxBackends entries. If we used a following array on shared memory, > > struct hoge > { > BackendId requestor[MAX_BACKENDS]; > int status[MAX_BACKENDS]; > LWLock lock; > }; If the requestor's id is 5 and dumper's id is 10, is this struct used like below? - hoge.requestor[10] = 5 - Both status[5] and status[10] change like "request", "idle" or "done" > This array has the size of 24 * MaxBackends + 16. 24kB for 1000 > backends. It could be on dsm since this feature is not used > commonly. Sorry but I'm not sure this calculation. Do you mean 16.24kB for 1000 backends? Regarding dsm, do you imagine using hoge on dsm? > > - The way to cancel a request already made. (or management of the dump > state transition.) > > Cancellation seems to contain some race conditions. But basically > that could be done by sending a request signal after setting the > hoge.requestor above to some special value, not needing the third I imagined hoge.requestor was set to requestor's backendid. Isn't it hoge.status? > type of signal. The special value should be different from the > initial state, which signals that the process is accepting a new > request. > > As the whole, that would looks like the folloing? > > ------------------------------------------------------------ > Successful request. > > Requestor dumper state > [idle] initial > [request] -------------------> requestor pid/backendid > -signal-> > [dumping] > <-signal-[done] > [read] > [done] --------------------> initial > > ------------------------------------------------------------ > On failure, the dumper signals with setting state to initial. > [request] -------------------> requestor pid/backendid > -signal-> > [dumping] > [failed] initial > <-signal- > (some other requestor might come meantime.) > <sees that the requestor is not me> Is this "some other requestor come case" relevant to the dumper failure? Regards, -- Atsushi Torikoshi > [failed] > > ------------------------------------------------------------ > If the requestor wants to cancel the request, it sets the state to > 'cancel' then signal. > > Requestor dumper state > [idle] initial > [request] -------------------> cancel > <if canceled. clean up> > [dumping] > <if canceled. clean up> > <-signal-[done] > > -signal-><try to clean up> > > > Other aspects to cnosider? > > regards.
Hi, I noticed that this patch fails on the cfbot. For this, I changed the status to: 'Waiting on Author'. Cheers, //Georgios The new status of this patch is: Waiting on Author
On 2020-10-28 15:32, torikoshia wrote: > On 2020-10-23 13:46, Kyotaro Horiguchi wrote: >> I think we might need to step-back to basic design of this feature >> since this patch seems to have unhandled corner cases that are >> difficult to find. I've written out the basic design below and attached corresponding patch. # Communication flow between the dumper and the requester - (1) When REQUESTING memory context dumping, the dumper adds an entry to the shared memory. The entry manages the dump state and it is set to 'REQUESTING'. - (2) The dumper sends the signal to the dumper and wait on the latch. - (3) The dumper looks into the corresponding shared memory entry and changes its state to 'DUMPING'. - (4) When the dumper completes dumping, it changes the state to 'DONE' and set the latch. - (5) The dumper reads the dump file and shows it to the user. Finally, the dumper removes the dump file and reset the shared memory entry. # Query cancellation - When the requestor cancels dumping, e.g. signaling using ctrl-C, the requestor changes the status of the shared memory entry to 'CANCELING'. - The dumper checks the status when it tries to change the state to 'DONE' at (4), and if the state is 'CANCELING', it removes the dump file and reset the shared memory entry. # Cleanup dump file and the shared memory entry - In the normal case, the dumper removes the dump file and resets the shared memory entry as described in (5). - When something like query cancellation or process termination happens on the dumper after (1) and before (3), in other words, the state is 'REQUESTING', the requestor does the cleanup. - When something happens on the dumper or the requestor after (3) and before (4), in other words, the state is 'DUMPING', the dumper does the cleanup. Specifically, if the requestor cancels the query, it just changes the state to 'CANCELING' and the dumper notices it and cleans up things later. OTOH, when the dumper fails to dump, it cleans up the dump file and deletes the entry on the shared memory. - When something happens on the requestor after (4), i.e., the state is 'DONE', the requestor does the cleanup. - In the case of receiving SIGKILL or power failure, all dump files are removed in the crash recovery process. Although there was a suggestion that shared memory hash table should be changed to more efficient structures, I haven't done it in this patch. I think it can be treated separately, I'm going to work on that later. On 2020-11-11 00:07, Georgios Kokolatos wrote: > Hi, > > I noticed that this patch fails on the cfbot. > For this, I changed the status to: 'Waiting on Author'. > > Cheers, > //Georgios > > The new status of this patch is: Waiting on Author Thanks for your notification and updated the patch. Changed the status to: 'Waiting on Author'. Regards, -- Atsushi Torikoshi
Attachment
On 2020/11/16 19:58, torikoshia wrote: > On 2020-10-28 15:32, torikoshia wrote: >> On 2020-10-23 13:46, Kyotaro Horiguchi wrote: > >>> I think we might need to step-back to basic design of this feature >>> since this patch seems to have unhandled corner cases that are >>> difficult to find. > > I've written out the basic design below and attached > corresponding patch. I'm starting to study how this feature behaves. At first, when I executed the following query, the function never returned. ISTM that since the autovacuum launcher cannot respond to the request of memory contexts dump, the function keeps waiting infinity. Is this a bug? Probably we should exclude non-backend proceses from the target processes to dump? Sorry if this was already discussed. SELECT pg_get_backend_memory_contexts(pid) FROM pg_stat_activity; Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Fujii Masao <masao.fujii@oss.nttdata.com> writes: > I'm starting to study how this feature behaves. At first, when I executed > the following query, the function never returned. ISTM that since > the autovacuum launcher cannot respond to the request of memory > contexts dump, the function keeps waiting infinity. Is this a bug? > Probably we should exclude non-backend proceses from the target > processes to dump? Sorry if this was already discussed. > SELECT pg_get_backend_memory_contexts(pid) FROM pg_stat_activity; FWIW, I think this patch is fundamentally unsafe. It's got a lot of the same problems that I complained about w.r.t. the nearby proposal to allow cross-backend stack trace dumping. It does avoid the trap of thinking that it can do work in a signal handler, but instead it supposes that it can do work involving very high-level objects such as shared hash tables in anyplace that might execute CHECK_FOR_INTERRUPTS. That's never going to be safe: the only real expectation the system has is that CHECK_FOR_INTERRUPTS is called at places where our state is sane enough that a transaction abort can clean up. Trying to do things like taking LWLocks is going to lead to deadlocks or worse. We need not even get into the hard questions, such as what happens when one process or the other exits unexpectedly. I also find the idea that this should be the same SQL function as pg_get_backend_memory_contexts to be a seriously bad decision. That means that it's not possible to GRANT the right to examine only your own process's memory --- with this proposal, that means granting the right to inspect every other process as well. Beyond that, the fact that there's no way to restrict the capability to just, say, other processes owned by the same user means that it's not really safe to GRANT to non-superusers anyway. Even with such a restriction added, things are problematic, since for example it would be possible to inquire into the workings of a security-definer function executing in another process that nominally is owned by your user. Between the security and implementation issues here, I really think we'd be best advised to just reject the concept, period. regards, tom lane
On 2020-12-03 10:36, Tom Lane wrote: > Fujii Masao <masao.fujii@oss.nttdata.com> writes: >> I'm starting to study how this feature behaves. At first, when I >> executed >> the following query, the function never returned. ISTM that since >> the autovacuum launcher cannot respond to the request of memory >> contexts dump, the function keeps waiting infinity. Is this a bug? >> Probably we should exclude non-backend proceses from the target >> processes to dump? Sorry if this was already discussed. > >> SELECT pg_get_backend_memory_contexts(pid) FROM pg_stat_activity; Thanks for trying it! It was not discussed explicitly, and I was going to do it later as commented. > + /* TODO: Check also whether backend or not. */ > > FWIW, I think this patch is fundamentally unsafe. It's got a > lot of the same problems that I complained about w.r.t. the > nearby proposal to allow cross-backend stack trace dumping. > It does avoid the trap of thinking that it can do work in > a signal handler, but instead it supposes that it can do > work involving very high-level objects such as shared hash tables > in anyplace that might execute CHECK_FOR_INTERRUPTS. That's > never going to be safe: the only real expectation the system > has is that CHECK_FOR_INTERRUPTS is called at places where our > state is sane enough that a transaction abort can clean up. > Trying to do things like taking LWLocks is going to lead to > deadlocks or worse. We need not even get into the hard questions, > such as what happens when one process or the other exits > unexpectedly. Thanks for reviewing! I may misunderstand something, but the dumper works not at CHECK_FOR_INTERRUPTS but during the client read, i.e., ProcessClientReadInterrupt(). Is it also unsafe? BTW, since there was a comment that the shared hash table used too much memory, I'm now rewriting this patch not to use the shared hash table but a simpler static shared memory struct. > I also find the idea that this should be the same SQL function > as pg_get_backend_memory_contexts to be a seriously bad decision. > That means that it's not possible to GRANT the right to examine > only your own process's memory --- with this proposal, that means > granting the right to inspect every other process as well. > > Beyond that, the fact that there's no way to restrict the capability > to just, say, other processes owned by the same user means that > it's not really safe to GRANT to non-superusers anyway. Even with > such a restriction added, things are problematic, since for example > it would be possible to inquire into the workings of a > security-definer function executing in another process that > nominally is owned by your user. I'm going to change the function name and restrict the executor to superusers. Is it enough? Regards,
On 2020-12-04 19:16, torikoshia wrote: > On 2020-12-03 10:36, Tom Lane wrote: >> Fujii Masao <masao.fujii@oss.nttdata.com> writes: >>> I'm starting to study how this feature behaves. At first, when I >>> executed >>> the following query, the function never returned. ISTM that since >>> the autovacuum launcher cannot respond to the request of memory >>> contexts dump, the function keeps waiting infinity. Is this a bug? >>> Probably we should exclude non-backend proceses from the target >>> processes to dump? Sorry if this was already discussed. >> >>> SELECT pg_get_backend_memory_contexts(pid) FROM >>> pg_stat_activity; > > Thanks for trying it! > > It was not discussed explicitly, and I was going to do it later > as commented. > >> + /* TODO: Check also whether backend or not. */ > >> >> FWIW, I think this patch is fundamentally unsafe. It's got a >> lot of the same problems that I complained about w.r.t. the >> nearby proposal to allow cross-backend stack trace dumping. >> It does avoid the trap of thinking that it can do work in >> a signal handler, but instead it supposes that it can do >> work involving very high-level objects such as shared hash tables >> in anyplace that might execute CHECK_FOR_INTERRUPTS. That's >> never going to be safe: the only real expectation the system >> has is that CHECK_FOR_INTERRUPTS is called at places where our >> state is sane enough that a transaction abort can clean up. >> Trying to do things like taking LWLocks is going to lead to >> deadlocks or worse. We need not even get into the hard questions, >> such as what happens when one process or the other exits >> unexpectedly. > > Thanks for reviewing! > > I may misunderstand something, but the dumper works not at > CHECK_FOR_INTERRUPTS but during the client read, i.e., > ProcessClientReadInterrupt(). > > Is it also unsafe? > > > BTW, since there was a comment that the shared hash table > used too much memory, I'm now rewriting this patch not to use > the shared hash table but a simpler static shared memory struct. Attached a rewritten patch. Accordingly, I also slightly modified the basic design as below. --- # Communication flow between the dumper and the requester - (1) When requesting memory context dumping, the dumper changes the struct on the shared memory state from 'ACCEPTABLE' to 'REQUESTING'. - (2) The dumper sends the signal to the dumper process and wait on the latch. - (3) When the dumper noticed the signal, it changes the state to 'DUMPING'. - (4) When the dumper completes dumping, it changes the state to 'DONE' and set the latch. - (5) The requestor reads the dump file and shows it to the user. Finally, the requestor removes the dump file and reset the shared memory state to 'ACCEPTABLE'. # Query cancellation - When the requestor cancels dumping, e.g. signaling using ctrl-C, the requestor changes the state of the shared memory to 'CANCELING'. - The dumper checks the state when it tries to change the state to 'DONE' at (4), and if the state is 'CANCELING', it initializes the dump file and reset the shared memory state to 'ACCEPTABLE'. # Cleanup dump file and the shared memory - In the normal case, the dumper removes the dump file and resets the shared memory entry as described in (5). - When something like query cancellation or process termination happens on the dumper after (1) and before (3), in other words, the state is 'REQUESTING', the requestor does the cleanup. - When something happens on the dumper or the requestor after (3) and before (4), in other words, the state is 'DUMPING', the dumper does the cleanup. Specifically, if the requestor cancels the query, it just changes the state to 'CANCELING' and the dumper notices it and cleans up things later. OTOH, when the dumper fails to dump, it cleans up the dump file and reset the shared memory state. - When something happens on the requestor after (4), i.e., the state is 'DONE', the requestor does the cleanup. - In the case of receiving SIGKILL or power failure, all dump files are removed in the crash recovery process. --- > >> I also find the idea that this should be the same SQL function >> as pg_get_backend_memory_contexts to be a seriously bad decision. >> That means that it's not possible to GRANT the right to examine >> only your own process's memory --- with this proposal, that means >> granting the right to inspect every other process as well. >> >> Beyond that, the fact that there's no way to restrict the capability >> to just, say, other processes owned by the same user means that >> it's not really safe to GRANT to non-superusers anyway. Even with >> such a restriction added, things are problematic, since for example >> it would be possible to inquire into the workings of a >> security-definer function executing in another process that >> nominally is owned by your user. > > I'm going to change the function name and restrict the executor to > superusers. Is it enough? In the attached patch, I changed the function name to pg_get_target_backend_memory_contexts() for now. Regards,
Attachment
Hi, On Thu, Dec 10, 2020 at 10:48 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2020-12-04 19:16, torikoshia wrote: > > On 2020-12-03 10:36, Tom Lane wrote: > >> Fujii Masao <masao.fujii@oss.nttdata.com> writes: > >>> I'm starting to study how this feature behaves. At first, when I > >>> executed > >>> the following query, the function never returned. ISTM that since > >>> the autovacuum launcher cannot respond to the request of memory > >>> contexts dump, the function keeps waiting infinity. Is this a bug? > >>> Probably we should exclude non-backend proceses from the target > >>> processes to dump? Sorry if this was already discussed. > >> > >>> SELECT pg_get_backend_memory_contexts(pid) FROM > >>> pg_stat_activity; > > > > Thanks for trying it! > > > > It was not discussed explicitly, and I was going to do it later > > as commented. > > > >> + /* TODO: Check also whether backend or not. */ > > > >> > >> FWIW, I think this patch is fundamentally unsafe. It's got a > >> lot of the same problems that I complained about w.r.t. the > >> nearby proposal to allow cross-backend stack trace dumping. > >> It does avoid the trap of thinking that it can do work in > >> a signal handler, but instead it supposes that it can do > >> work involving very high-level objects such as shared hash tables > >> in anyplace that might execute CHECK_FOR_INTERRUPTS. That's > >> never going to be safe: the only real expectation the system > >> has is that CHECK_FOR_INTERRUPTS is called at places where our > >> state is sane enough that a transaction abort can clean up. > >> Trying to do things like taking LWLocks is going to lead to > >> deadlocks or worse. We need not even get into the hard questions, > >> such as what happens when one process or the other exits > >> unexpectedly. > > > > Thanks for reviewing! > > > > I may misunderstand something, but the dumper works not at > > CHECK_FOR_INTERRUPTS but during the client read, i.e., > > ProcessClientReadInterrupt(). > > > > Is it also unsafe? > > > > > > BTW, since there was a comment that the shared hash table > > used too much memory, I'm now rewriting this patch not to use > > the shared hash table but a simpler static shared memory struct. > > Attached a rewritten patch. Thanks for updating patch. But when I had applyed the patch to the current HEAD and did make, I got an error due to duplicate OIDs. You need to rebase the patch. > Accordingly, I also slightly modified the basic design as below. > > --- > # Communication flow between the dumper and the requester > - (1) When requesting memory context dumping, the dumper changes > the struct on the shared memory state from 'ACCEPTABLE' to > 'REQUESTING'. > - (2) The dumper sends the signal to the dumper process and wait on > the latch. > - (3) When the dumper noticed the signal, it changes the state to > 'DUMPING'. > - (4) When the dumper completes dumping, it changes the state to > 'DONE' and set the latch. > - (5) The requestor reads the dump file and shows it to the user. > Finally, the requestor removes the dump file and reset the shared > memory state to 'ACCEPTABLE'. > > # Query cancellation > - When the requestor cancels dumping, e.g. signaling using ctrl-C, > the requestor changes the state of the shared memory to 'CANCELING'. > - The dumper checks the state when it tries to change the state to > 'DONE' at (4), and if the state is 'CANCELING', it initializes the > dump file and reset the shared memory state to 'ACCEPTABLE'. > > # Cleanup dump file and the shared memory > - In the normal case, the dumper removes the dump file and resets > the shared memory entry as described in (5). > - When something like query cancellation or process termination > happens on the dumper after (1) and before (3), in other words, > the state is 'REQUESTING', the requestor does the cleanup. > - When something happens on the dumper or the requestor after (3) > and before (4), in other words, the state is 'DUMPING', the dumper > does the cleanup. Specifically, if the requestor cancels the query, > it just changes the state to 'CANCELING' and the dumper notices it > and cleans up things later. > OTOH, when the dumper fails to dump, it cleans up the dump file and > reset the shared memory state. > - When something happens on the requestor after (4), i.e., the state > is 'DONE', the requestor does the cleanup. > - In the case of receiving SIGKILL or power failure, all dump files > are removed in the crash recovery process. > --- If the dumper is terminated before it dumps, the requestor will appear to enter an infinite loop because the status of mcxtdumpShmem will not change. The following are the steps to reproduce. - session1 BEGIN; LOCK TABLE t; - session2 SELECT * FROM t; -- wait - session3 select pg_get_target_backend_memory_contexts(<pid of session2>); -- wait - session1 select pg_terminate_backend(<pid of session2>); -- kill session2 - session3 waits forever. Therefore, you may need to set mcxtdumpShmem->dump_status to MCXTDUMPSTATUS_CANCELING or other status before the dumper terminates. Also, although I have not been able to reproduce it, I believe that with the current design, if the requestor disappears right after the dumper dumps the memory information, the dump file will remain. Since the current design appears to allow only one requestor per instance, when the requestor requests a dump, it might be a good idea to delete any remaining dump files, if any. The following are comments on the code. + proc = BackendPidGetProc(dst_pid); + + if (proc == NULL) + { + ereport(WARNING, + (errmsg("PID %d is not a PostgreSQL server process", dst_pid))); + + return (Datum) 1; + } For now, background writer, checkpointer and WAL writer are belong to the auxiliary process. Therefore, if we specify the PIDs of these processes for pg_get_target_backend_memory_contexts(), "PID xxxx is not a PostgreSQL server process" whould be outoput. This confuses the user. How about use AuxiliaryPidGetProc() to determine these processes? + ereport(INFO, + (errmsg("The request has failed and now PID %d is requsting dumping.", + mcxtdumpShmem->src_pid))); + + LWLockRelease(McxtDumpLock); You can release LWLock before ereport. + Assert(mcxtdumpShmem->dump_status = MCXTDUMPSTATUS_REQUESTING); typo? It might be "mcxtdumpShmem->dump_status == MCXTDUMPSTATUS_REQUESTING". + ereport(INFO, + (errmsg("PID %d is looked up by another process", pid))); This message is not accurate. Because, in the current design, only one session can request a dump, so if there are multiple requests, this message will be output even if the request is for a PID that is not looked up. + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write memory context file \"%s\": %m", + dumpfile))); It would be better to use "dump file" instead of "memory context file". +static void +McxtReqKill(int code, Datum arg) The function McxtReqKill looks like it should have a different name. (Since it doesn't do any kill). How about McxtReqcCleanup or something similar? Best regards, > >> I also find the idea that this should be the same SQL function > >> as pg_get_backend_memory_contexts to be a seriously bad decision. > >> That means that it's not possible to GRANT the right to examine > >> only your own process's memory --- with this proposal, that means > >> granting the right to inspect every other process as well. > >> > >> Beyond that, the fact that there's no way to restrict the capability > >> to just, say, other processes owned by the same user means that > >> it's not really safe to GRANT to non-superusers anyway. Even with > >> such a restriction added, things are problematic, since for example > >> it would be possible to inquire into the workings of a > >> security-definer function executing in another process that > >> nominally is owned by your user. > > > > I'm going to change the function name and restrict the executor to > > superusers. Is it enough? > > In the attached patch, I changed the function name to > pg_get_target_backend_memory_contexts() for now. > > > Regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
On Fri, Dec 25, 2020 at 6:08 PM Kasahara Tatsuhito <kasahara.tatsuhito@gmail.com> wrote: Thanks for reviewing and kind suggestion! >> Attached a rewritten patch. > Thanks for updating patch. > > But when I had applyed the patch to the current HEAD and did make, I > got an error due to duplicate OIDs. > You need to rebase the patch. Assigned another OID. >> Accordingly, I also slightly modified the basic design as below. >> >> --- >> # Communication flow between the dumper and the requester >> - (1) When requesting memory context dumping, the dumper changes >> the struct on the shared memory state from 'ACCEPTABLE' to >> 'REQUESTING'. >> - (2) The dumper sends the signal to the dumper process and wait on >> the latch. >> - (3) When the dumper noticed the signal, it changes the state to >> 'DUMPING'. >> - (4) When the dumper completes dumping, it changes the state to >> 'DONE' and set the latch. >> - (5) The requestor reads the dump file and shows it to the user. >> Finally, the requestor removes the dump file and reset the shared >> memory state to 'ACCEPTABLE'. >> >> # Query cancellation >> - When the requestor cancels dumping, e.g. signaling using ctrl-C, >> the requestor changes the state of the shared memory to 'CANCELING'. >> - The dumper checks the state when it tries to change the state to >> 'DONE' at (4), and if the state is 'CANCELING', it initializes the >> dump file and reset the shared memory state to 'ACCEPTABLE'. >> >> # Cleanup dump file and the shared memory >> - In the normal case, the dumper removes the dump file and resets >> the shared memory entry as described in (5). >> - When something like query cancellation or process termination >> happens on the dumper after (1) and before (3), in other words, >> the state is 'REQUESTING', the requestor does the cleanup. >> - When something happens on the dumper or the requestor after (3) >> and before (4), in other words, the state is 'DUMPING', the dumper >> does the cleanup. Specifically, if the requestor cancels the query, >> it just changes the state to 'CANCELING' and the dumper notices it >> and cleans up things later. >> OTOH, when the dumper fails to dump, it cleans up the dump file and >> reset the shared memory state. >> - When something happens on the requestor after (4), i.e., the state >> is 'DONE', the requestor does the cleanup. >> - In the case of receiving SIGKILL or power failure, all dump files >> are removed in the crash recovery process. >> --- > If the dumper is terminated before it dumps, the requestor will appear > to enter an > infinite loop because the status of mcxtdumpShmem will not change. > The following are the steps to reproduce. > > - session1 > BEGIN; LOCK TABLE t; > - session2 > SELECT * FROM t; -- wait > - session3 > select pg_get_target_backend_memory_contexts(<pid of session2>); > -- wait > - session1 > select pg_terminate_backend(<pid of session2>); -- kill session2 > - session3 waits forever. > > Therefore, you may need to set mcxtdumpShmem->dump_status to > MCXTDUMPSTATUS_CANCELING > or other status before the dumper terminates. In this case, it may be difficult for the dumper to change dump_status because it's waiting for latch and dump_memory_contexts() is not called yet. Instead, it's possible for the requestor to check the existence of the dumper process periodically during waiting. I added this logic to the attached patch. > Also, although I have not been able to reproduce it, I believe that > with the current design, > if the requestor disappears right after the dumper dumps the memory > information, > the dump file will remain. > Since the current design appears to allow only one requestor per > instance, when the requestor > requests a dump, it might be a good idea to delete any remaining dump > files, if any. Although I'm not sure when the dump file remains, deleting any remaining dump files seems good for safety. I also added this idea to the attached patch. > The following are comments on the code. > > + proc = BackendPidGetProc(dst_pid); > + > + if (proc == NULL) > + { > + ereport(WARNING, > + (errmsg("PID %d is not a PostgreSQL server process", > dst_pid))); > + > + return (Datum) 1; > + } > For now, background writer, checkpointer and WAL writer are belong to > the auxiliary process. > Therefore, if we specify the PIDs of these processes for > pg_get_target_backend_memory_contexts(), > "PID xxxx is not a PostgreSQL server process" whould be outoput. > This confuses the user. > How about use AuxiliaryPidGetProc() to determine these processes? Thanks and I modified the patch to output the below message when it's an auxiliary process. | PID %d is not a PostgreSQL backend process but an auxiliary process. > > + ereport(INFO, > + (errmsg("The request has failed and now PID %d is > requsting dumping.", > + mcxtdumpShmem->src_pid))); > + > + LWLockRelease(McxtDumpLock); > You can release LWLock before ereport. Modified to release the lock before ereport. > + Assert(mcxtdumpShmem->dump_status = MCXTDUMPSTATUS_REQUESTING); > typo? > It might be "mcxtdumpShmem->dump_status == MCXTDUMPSTATUS_REQUESTING". Ops, it's a serious typo. Fixed it. > + ereport(INFO, > + (errmsg("PID %d is looked up by another process", > pid))); > This message is not accurate. > Because, in the current design, only one session can request a dump, > so if there are multiple requests, this message will be output even if > the > request is for a PID that is not looked up. Exactly. Modified the message as below. | Only one session can dump at a time and another session is dumping currently. > + if (fpout == NULL) > + { > + ereport(LOG, > + (errcode_for_file_access(), > + errmsg("could not write memory context file \"%s\": > %m", > + dumpfile))); > It would be better to use "dump file" instead of "memory context file". > > +static void > +McxtReqKill(int code, Datum arg) > The function McxtReqKill looks like it should have a different name. > (Since it doesn't do any kill). > How about McxtReqcCleanup or something similar? Thanks for your idea and I changed the name to McxtReqCleanup. Regards,
Attachment
v7 that fixes recent conflicts. It also changed the behavior of requestor when another requestor is already working for simplicity. In this case, v6 patch makes the requestor wait. v7 patch makes the requestor quit. Regards, -- Atsushi Torikoshi
Attachment
Since pg_get_target_backend_memory_contexts() waits to dump memory and it could lead dead lock as below. - session1 BEGIN; TRUNCATE t; - session2 BEGIN; TRUNCATE t; -- wait - session1 SELECT * FROM pg_get_target_backend_memory_contexts(<pid of session 2>); --wait Thanks for notifying me, Fujii-san. Attached v8 patch that prohibited calling the function inside transactions. Regards, -- Atsushi Torikoshi
Attachment
On 2021/03/17 22:24, torikoshia wrote: > I remade the patch and introduced a function > pg_print_backend_memory_contexts(PID) which prints the memory contexts of > the specified PID to elog. Thanks for the patch! > =# SELECT pg_print_backend_memory_contexts(450855); > > ** log output ** > 2021-03-17 15:21:01.942 JST [450855] LOG: Printing memory contexts of PID 450855 > 2021-03-17 15:21:01.942 JST [450855] LOG: level: 0 TopMemoryContext: 68720 total in 5 blocks; 16312 free (15 chunks);52408 used > 2021-03-17 15:21:01.942 JST [450855] LOG: level: 1 Prepared Queries: 65536 total in 4 blocks; 35088 free (14 chunks);30448 used > 2021-03-17 15:21:01.942 JST [450855] LOG: level: 1 pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks;1408 free (0 chunks); 6784 used > ..(snip).. > 2021-03-17 15:21:01.942 JST [450855] LOG: level: 2 CachedPlanSource: 4096 total in 3 blocks; 680 free (0 chunks); 3416used: PREPARE hoge_200 AS SELECT * FROM pgbench_accounts WHERE aid = 1111111111111111111111111111111111111... > 2021-03-17 15:21:01.942 JST [450855] LOG: level: 3 CachedPlanQuery: 4096 total in 3 blocks; 464 free (0 chunks); 3632used > ..(snip).. > 2021-03-17 15:21:01.945 JST [450855] LOG: level: 1 Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544used > 2021-03-17 15:21:01.945 JST [450855] LOG: level: 1 ErrorContext: 8192 total in 1 blocks; 7928 free (5 chunks); 264used > 2021-03-17 15:21:01.945 JST [450855] LOG: Grand total: 2802080 bytes in 1399 blocks; 480568 free (178 chunks); 2321512used > > > As above, the output is almost the same as MemoryContextStatsPrint() > except for the way of expression of the level. > MemoryContextStatsPrint() uses indents, but > pg_print_backend_memory_contexts() writes it as "level: %d". This format looks better to me. > Since there was discussion about enlarging StringInfo may cause > errors on OOM[1], this patch calls elog for each context. > > As with MemoryContextStatsPrint(), each context shows 100 > children at most. > I once thought it should be configurable, but something like > pg_print_backend_memory_contexts(PID, num_children) needs to send > the 'num_children' from requestor to dumper and it seems to require > another infrastructure. > Creating a new GUC for this seems overkill. > If MemoryContextStatsPrint(), i.e. showing 100 children at most is > enough, this hard limit may be acceptable. Can't this number be passed via shared memory? > Only superusers can call pg_print_backend_memory_contexts(). + /* Only allow superusers to signal superuser-owned backends. */ + if (superuser_arg(proc->roleId) && !superuser()) The patch seems to allow even non-superuser to request to print the memory contexts if the target backend is owned by non-superuser. Is this intentional? I think that only superuser should be allowed to execute pg_print_backend_memory_contexts() whoever owns the target backend. Because that function can cause lots of log messages. > I'm going to add documentation and regression tests. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021-03-18 15:09, Fujii Masao wrote: Thanks for your comments! > On 2021/03/17 22:24, torikoshia wrote: >> I remade the patch and introduced a function >> pg_print_backend_memory_contexts(PID) which prints the memory contexts >> of >> the specified PID to elog. > > Thanks for the patch! > > >> =# SELECT pg_print_backend_memory_contexts(450855); >> >> ** log output ** >> 2021-03-17 15:21:01.942 JST [450855] LOG: Printing memory contexts >> of PID 450855 >> 2021-03-17 15:21:01.942 JST [450855] LOG: level: 0 >> TopMemoryContext: 68720 total in 5 blocks; 16312 free (15 chunks); >> 52408 used >> 2021-03-17 15:21:01.942 JST [450855] LOG: level: 1 Prepared >> Queries: 65536 total in 4 blocks; 35088 free (14 chunks); 30448 used >> 2021-03-17 15:21:01.942 JST [450855] LOG: level: 1 pgstat >> TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 >> chunks); 6784 used >> ..(snip).. >> 2021-03-17 15:21:01.942 JST [450855] LOG: level: 2 >> CachedPlanSource: 4096 total in 3 blocks; 680 free (0 chunks); 3416 >> used: PREPARE hoge_200 AS SELECT * FROM pgbench_accounts WHERE aid = >> 1111111111111111111111111111111111111... >> 2021-03-17 15:21:01.942 JST [450855] LOG: level: 3 >> CachedPlanQuery: 4096 total in 3 blocks; 464 free (0 chunks); 3632 >> used >> ..(snip).. >> 2021-03-17 15:21:01.945 JST [450855] LOG: level: 1 Timezones: >> 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used >> 2021-03-17 15:21:01.945 JST [450855] LOG: level: 1 ErrorContext: >> 8192 total in 1 blocks; 7928 free (5 chunks); 264 used >> 2021-03-17 15:21:01.945 JST [450855] LOG: Grand total: 2802080 >> bytes in 1399 blocks; 480568 free (178 chunks); 2321512 used >> >> >> As above, the output is almost the same as MemoryContextStatsPrint() >> except for the way of expression of the level. >> MemoryContextStatsPrint() uses indents, but >> pg_print_backend_memory_contexts() writes it as "level: %d". > > This format looks better to me. > > >> Since there was discussion about enlarging StringInfo may cause >> errors on OOM[1], this patch calls elog for each context. >> >> As with MemoryContextStatsPrint(), each context shows 100 >> children at most. >> I once thought it should be configurable, but something like >> pg_print_backend_memory_contexts(PID, num_children) needs to send >> the 'num_children' from requestor to dumper and it seems to require >> another infrastructure. >> Creating a new GUC for this seems overkill. >> If MemoryContextStatsPrint(), i.e. showing 100 children at most is >> enough, this hard limit may be acceptable. > > Can't this number be passed via shared memory? The attached patch uses static shared memory to pass the number. As documented, the current implementation allows that when multiple pg_print_backend_memory_contexts() called in succession or simultaneously, max_children can be the one of another pg_print_backend_memory_contexts(). I had tried to avoid this by adding some state information and using before_shmem_exit() in case of process termination for cleaning up the state information as in the patch I presented earlier, but since kill() returns success before the dumper called signal handler, it seemed there were times when we couldn't clean up the state. Since this happens only when multiple pg_print_backend_memory_contexts() are called and their specified number of children are different, and the effect is just the not intended number of children to print, it might be acceptable. Or it might be better to wait for some seconds if num_chilren on shared memory is not the initialized value(meaning some other process is requesting to print memory contexts). >> Only superusers can call pg_print_backend_memory_contexts(). > > + /* Only allow superusers to signal superuser-owned backends. */ > + if (superuser_arg(proc->roleId) && !superuser()) > > The patch seems to allow even non-superuser to request to print the > memory > contexts if the target backend is owned by non-superuser. Is this > intentional? > I think that only superuser should be allowed to execute > pg_print_backend_memory_contexts() whoever owns the target backend. > Because that function can cause lots of log messages. Thanks, it's not intentional, modified it. >> I'm going to add documentation and regression tests. Added them. Regards,
Attachment
At Mon, 22 Mar 2021 15:09:58 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in > >> If MemoryContextStatsPrint(), i.e. showing 100 children at most is > >> enough, this hard limit may be acceptable. > > Can't this number be passed via shared memory? > > The attached patch uses static shared memory to pass the number. "pg_print_backend_memory_contexts" That name looks like as if it returns the result as text when used on command-line. We could have pg_get_backend_memory_context(bool dump_to_log (or where to dump), int limit). Or couldn't we name it differently even in the ase we add a separate function? +/* + * MaxChildrenPerContext + * Max number of children to print per one parent context. + */ +int *MaxChildrenPerContext = NULL; Perhaps it'd be better to have a struct even if it consists only of one member. (Aligned) C-int values are atomic so we can omit the McxtPrintLock. (I don't think it's a problem even if it is modifed while reading^^:) + if(max_children <= 0) + { + ereport(WARNING, + (errmsg("%d is invalid value", max_children), + errhint("second parameter is the number of context and it must be set to a value greater than or equalto 1"))); It's annoying to choose a number large enough when I want to dump children unlimitedly. Couldn't we use 0 to specify "unlimited"? + (errmsg("%d is invalid value", max_children), + errhint("second parameter is the number of context and it must be set to a value greater than or equalto 1"))); For the main message, (I think) we usually spell the "%d is invalid value" as "maximum number of children must be positive" or such. For the hint, we don't need a copy of the primary section of the documentation here. I think we should ERROR out for invalid parameters, at least for max_children. I'm not sure about pid since we might call it based on pg_stat_activity.. + if(!SendProcSignal(pid, PROCSIG_PRINT_MEMORY_CONTEXT, InvalidBackendId)) We know the backendid of the process here. + if (is_dst_stderr) + { + for (i = 0; i <= level; i++) + fprintf(stderr, " "); The fprintf path is used nowhere in the patch at all. It can be used while attaching debugger but I'm not sure we need that code. The footprint of this patch is largely shrinked by removing it. + strcat(truncated_ident, delimiter); strcpy is sufficient here. And we don't need the delimiter to be a variable. (we can copy a string literal into truncate_ident, then count the length of truncate_ident, instead of the delimiter variable.) + $current_logfiles = slurp_file($node->data_dir . '/current_logfiles'); ... +my $lfname = $current_logfiles; +$lfname =~ s/^stderr //; +chomp $lfname; $node->logfile is the current log file name. + 'target PID is not PostgreSQL server process'); Maybe "check if PID check is working" or such? And, we can do something like the following to exercise in a more practical way. select pg_print_backend...(pid,) from pg_stat_activity where backend_type = 'checkpointer'; > As documented, the current implementation allows that when multiple > pg_print_backend_memory_contexts() called in succession or > simultaneously, max_children can be the one of another > pg_print_backend_memory_contexts(). > I had tried to avoid this by adding some state information and using > before_shmem_exit() in case of process termination for cleaning up the > state information as in the patch I presented earlier, but since > kill() > returns success before the dumper called signal handler, it seemed > there were times when we couldn't clean up the state. > Since this happens only when multiple > pg_print_backend_memory_contexts() > are called and their specified number of children are different, and > the > effect is just the not intended number of children to print, it might > be > acceptable. I see it as a non-issue. Even though the behavior looks somewhat strange, that usage is stranger than the behavior. > Or it might be better to wait for some seconds if num_chilren on > shared > memory is not the initialized value(meaning some other process is > requesting to print memory contexts). > > >> Only superusers can call pg_print_backend_memory_contexts(). > > + /* Only allow superusers to signal superuser-owned backends. */ > > + if (superuser_arg(proc->roleId) && !superuser()) > > The patch seems to allow even non-superuser to request to print the > > memory > > contexts if the target backend is owned by non-superuser. Is this > > intentional? > > I think that only superuser should be allowed to execute > > pg_print_backend_memory_contexts() whoever owns the target backend. > > Because that function can cause lots of log messages. > > Thanks, it's not intentional, modified it. By the way we can send a procsig to other than a client backend. And most of the postgres processes are using the standard signal handler and intializes the procsig facility. So some of such kind of processes can dump the memory context information as-is. Otherwise we can add CHECK_FOR_INTERRUPT to appropriate place to allow that. I'm not sure how it is useful for other kind of processes, but it might be useful for autovacuum workers. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021-03-23 17:24, Kyotaro Horiguchi wrote: Thanks for reviewing and suggestions! > At Mon, 22 Mar 2021 15:09:58 +0900, torikoshia > <torikoshia@oss.nttdata.com> wrote in >> >> If MemoryContextStatsPrint(), i.e. showing 100 children at most is >> >> enough, this hard limit may be acceptable. >> > Can't this number be passed via shared memory? >> >> The attached patch uses static shared memory to pass the number. > > "pg_print_backend_memory_contexts" > > That name looks like as if it returns the result as text when used on > command-line. We could have pg_get_backend_memory_context(bool > dump_to_log (or where to dump), int limit). Or couldn't we name it > differently even in the ase we add a separate function? Redefined pg_get_backend_memory_contexts() as pg_get_backend_memory_contexts(pid, int max_children). When pid equals 0, pg_get_backend_memory_contexts() prints local memory contexts as original pg_get_backend_memory_contexts() does. In this case, 'max_children' is ignored. When 'pid' does not equal 0 and it is the PID of the client backend, memory contexts are logged through elog(). > > +/* > + * MaxChildrenPerContext > + * Max number of children to print per one parent context. > + */ > +int *MaxChildrenPerContext = NULL; > > Perhaps it'd be better to have a struct even if it consists only of > one member. (Aligned) C-int values are atomic so we can omit the > McxtPrintLock. (I don't think it's a problem even if it is modifed > while reading^^:) Fixed them. > + if(max_children <= 0) > + { > + ereport(WARNING, > + (errmsg("%d is invalid value", > max_children), > + errhint("second parameter is the number > of context and it must > be set to a value greater than or equal to 1"))); > > It's annoying to choose a number large enough when I want to dump > children unlimitedly. Couldn't we use 0 to specify "unlimited"? Modified as you suggested. > + (errmsg("%d is invalid value", > max_children), > + errhint("second parameter is the number > of context and it must > be set to a value greater than or equal to 1"))); > > For the main message, (I think) we usually spell the "%d is invalid > value" as "maximum number of children must be positive" or such. For > the hint, we don't need a copy of the primary section of the > documentation here. Modified it to "The maximum number of children must be greater than 0". > > I think we should ERROR out for invalid parameters, at least for > max_children. I'm not sure about pid since we might call it based on > pg_stat_activity.. Changed to ERROR out when the 'max_children' is less than 0. Regarding pid, I left it untouched considering the consistency with other signal sending functions such as pg_cancel_backend(). > + if(!SendProcSignal(pid, PROCSIG_PRINT_MEMORY_CONTEXT, > InvalidBackendId)) > > We know the backendid of the process here. Added it. > > + if (is_dst_stderr) > + { > + for (i = 0; i <= level; i++) > + fprintf(stderr, " "); > > The fprintf path is used nowhere in the patch at all. It can be used > while attaching debugger but I'm not sure we need that code. The > footprint of this patch is largely shrinked by removing it. According to the past discussion[1], people wanted MemoryContextStats as it was, so I think it's better that MemoryContextStats can be used as before. > > + strcat(truncated_ident, delimiter); > > strcpy is sufficient here. And we don't need the delimiter to be a > variable. (we can copy a string literal into truncate_ident, then > count the length of truncate_ident, instead of the delimiter > variable.) True. > + $current_logfiles = slurp_file($node->data_dir . > '/current_logfiles'); > ... > +my $lfname = $current_logfiles; > +$lfname =~ s/^stderr //; > +chomp $lfname; > > $node->logfile is the current log file name. > > + 'target PID is not PostgreSQL server process'); > > Maybe "check if PID check is working" or such? And, we can do > something like the following to exercise in a more practical way. > > select pg_print_backend...(pid,) from pg_stat_activity where > backend_type = 'checkpointer'; It seems better. >> As documented, the current implementation allows that when multiple >> pg_print_backend_memory_contexts() called in succession or >> simultaneously, max_children can be the one of another >> pg_print_backend_memory_contexts(). >> I had tried to avoid this by adding some state information and using >> before_shmem_exit() in case of process termination for cleaning up the >> state information as in the patch I presented earlier, but since >> kill() >> returns success before the dumper called signal handler, it seemed >> there were times when we couldn't clean up the state. >> Since this happens only when multiple >> pg_print_backend_memory_contexts() >> are called and their specified number of children are different, and >> the >> effect is just the not intended number of children to print, it might >> be >> acceptable. > > I see it as a non-issue. Even though the behavior looks somewhat > strange, that usage is stranger than the behavior. Thanks for your comments! >> Or it might be better to wait for some seconds if num_chilren on >> shared >> memory is not the initialized value(meaning some other process is >> requesting to print memory contexts). >> >> >> Only superusers can call pg_print_backend_memory_contexts(). >> > + /* Only allow superusers to signal superuser-owned backends. */ >> > + if (superuser_arg(proc->roleId) && !superuser()) >> > The patch seems to allow even non-superuser to request to print the >> > memory >> > contexts if the target backend is owned by non-superuser. Is this >> > intentional? >> > I think that only superuser should be allowed to execute >> > pg_print_backend_memory_contexts() whoever owns the target backend. >> > Because that function can cause lots of log messages. >> >> Thanks, it's not intentional, modified it. > > By the way we can send a procsig to other than a client backend. And > most of the postgres processes are using the standard signal handler > and intializes the procsig facility. So some of such kind of processes > can dump the memory context information as-is. Otherwise we can add > CHECK_FOR_INTERRUPT to appropriate place to allow that. I'm not sure > how it is useful for other kind of processes, but it might be useful > for autovacuum workers. Yeah, I also think it's possible to get memory contexts other than the client backend. But I'm not sure whether people want to do it... [1] https://www.postgresql.org/message-id/906.1513707472%40sss.pgh.pa.us regards.
Attachment
On 2021/03/25 0:17, torikoshia wrote: > On 2021-03-23 17:24, Kyotaro Horiguchi wrote: > > Thanks for reviewing and suggestions! The patched version failed to be compiled as follows. Could you fix this issue? mcxtfuncs.c:22:10: fatal error: utils/mcxtfuncs.h: No such file or directory #include "utils/mcxtfuncs.h" ^~~~~~~~~~~~~~~~~~~ compilation terminated. make[4]: *** [<builtin>: mcxtfuncs.o] Error 1 make[4]: *** Waiting for unfinished jobs.... make[3]: *** [../../../src/backend/common.mk:39: adt-recursive] Error 2 make[3]: *** Waiting for unfinished jobs.... make[2]: *** [common.mk:39: utils-recursive] Error 2 make[1]: *** [Makefile:42: all-backend-recurse] Error 2 make: *** [GNUmakefile:11: all-src-recurse] Error 2 https://cirrus-ci.com/task/4621477321375744 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021-03-25 22:02, Fujii Masao wrote: > On 2021/03/25 0:17, torikoshia wrote: >> On 2021-03-23 17:24, Kyotaro Horiguchi wrote: >> >> Thanks for reviewing and suggestions! > > The patched version failed to be compiled as follows. Could you fix > this issue? Sorry, it included a header file that's not contained in the current version patch. Attached new one. > mcxtfuncs.c:22:10: fatal error: utils/mcxtfuncs.h: No such file or > directory > #include "utils/mcxtfuncs.h" > ^~~~~~~~~~~~~~~~~~~ > compilation terminated. > make[4]: *** [<builtin>: mcxtfuncs.o] Error 1 > make[4]: *** Waiting for unfinished jobs.... > make[3]: *** [../../../src/backend/common.mk:39: adt-recursive] Error 2 > make[3]: *** Waiting for unfinished jobs.... > make[2]: *** [common.mk:39: utils-recursive] Error 2 > make[1]: *** [Makefile:42: all-backend-recurse] Error 2 > make: *** [GNUmakefile:11: all-src-recurse] Error 2 > > https://cirrus-ci.com/task/4621477321375744 > > Regards,
Attachment
At Thu, 25 Mar 2021 23:45:17 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in > Attached new one. Thanks! - SELECT * FROM pg_get_backend_memory_contexts(); + SELECT * FROM pg_get_backend_memory_contexts(0, 0); However we can freely change the signature since it's new in 14, but I see the (void) signature as useful. I vaguely thought of defining pg_get_backend_memory_contexts(void) in pg_proc.dat then defining (int, int) as a different function in system_views.sql. That allows the function of the second signature to output nothing. You can make a distinction between the two signatures by using PG_NARGS() in the C function. That being said, it's somewhat uneasy that two functions with the same name returns different values. I'd like to hear others what they feel like about the behavior. +# print memory context of specified backend + This looks like a garbage. + if( pid == 0) Odd spacing:p +note "current_logfiles = $current_logfiles"; + +like( + $current_logfiles, + qr|^stderr log/postgresql-.*log$|, + 'current_logfiles is sane'); + +my $lfname = $current_logfiles; +$lfname =~ s/^stderr //; +chomp $lfname; + +my $logfile; +my $print_count; + +# Verify that the backtraces of the processes are logged into logfile. +for (my $attempts = 0; $attempts < $max_attempts; $attempts++) +{ + $logfile = $node->data_dir . '/' . $lfname; It seems that you forgot to include the change on this part in v4. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/03/26 12:01, Kyotaro Horiguchi wrote: > At Thu, 25 Mar 2021 23:45:17 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in >> Attached new one. Regarding the argument max_children, isn't it better to set its default value, e.g., 100 like MemoryContextStats() uses? + ereport(WARNING, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be a superuser to log memory contexts"))); IMO this type of error, i.e., permission error, should be treated as ERROR rather than WARNING, like pg_terminate_backend() does. + ereport(WARNING, + (errmsg("failed to send signal: %m"))); Per message style rule, "could not send signal to process %d: %m" is better? > Thanks! > > - SELECT * FROM pg_get_backend_memory_contexts(); > + SELECT * FROM pg_get_backend_memory_contexts(0, 0); > > However we can freely change the signature since it's new in 14, but I > see the (void) signature as useful. I vaguely thought of defining > pg_get_backend_memory_contexts(void) in pg_proc.dat then defining > (int, int) as a different function in system_views.sql. That allows > the function of the second signature to output nothing. You can make a > distinction between the two signatures by using PG_NARGS() in the C > function. > > That being said, it's somewhat uneasy that two functions with the same > name returns different values. I'd like to hear others what they feel > like about the behavior. I think that it's confusing to merge two different features into one function. Isn't it better to leave pg_get_backend_memory_contexts() as it is, and make the log-memory-contexts feature as separate function, e.g., pg_log_backend_memory_contexts()? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021/03/26 12:26, Fujii Masao wrote: > > > On 2021/03/26 12:01, Kyotaro Horiguchi wrote: >> At Thu, 25 Mar 2021 23:45:17 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in >>> Attached new one. > > Regarding the argument max_children, isn't it better to set its default value, > e.g., 100 like MemoryContextStats() uses? + mcxtLogData->maxChildrenPerContext = max_children; + + if(!SendProcSignal(pid, PROCSIG_LOG_MEMORY_CONTEXT, proc->backendId)) + PG_RETURN_BOOL(true); Do we need memory barrier here? Otherwise, there is a race condition where maxChildrenPerContext has not been set yet when the target backend that received signal read that shared variable. No? + Note that when multiple + <function>pg_get_backend_memory_contexts</function> called in + succession or simultaneously, <parameter>max_children</parameter> can + be the one of another + <function>pg_get_backend_memory_contexts</function>. + </para></entry> This might not be an issue in practice as Horiguchi-san said upthread. But this looks a bit ugly and might be footgun later. The current approach using shared memory to pass this information to the target backends might be overkill, and too complicated to polish the design at the stage just before feature freeze. So I'd withdraw my previous comment and am OK to use the hard-coded value as max_children at the first version of this feature. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Fri, 26 Mar 2021 12:26:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/03/26 12:01, Kyotaro Horiguchi wrote: > > At Thu, 25 Mar 2021 23:45:17 +0900, torikoshia > > <torikoshia@oss.nttdata.com> wrote in > >> Attached new one. > > Regarding the argument max_children, isn't it better to set its > default value, > e.g., 100 like MemoryContextStats() uses? > + ereport(WARNING, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("must be a superuser to log memory contexts"))); > > IMO this type of error, i.e., permission error, should be treated as > ERROR > rather than WARNING, like pg_terminate_backend() does. > + ereport(WARNING, > + (errmsg("failed to send signal: %m"))); > > Per message style rule, "could not send signal to process %d: %m" is > better? +1 x 3 for the above. > > Thanks! > > - SELECT * FROM pg_get_backend_memory_contexts(); > > + SELECT * FROM pg_get_backend_memory_contexts(0, 0); > > However we can freely change the signature since it's new in 14, but I > > see the (void) signature as useful. I vaguely thought of defining > > pg_get_backend_memory_contexts(void) in pg_proc.dat then defining > > (int, int) as a different function in system_views.sql. That allows > > the function of the second signature to output nothing. You can make a > > distinction between the two signatures by using PG_NARGS() in the C > > function. > > That being said, it's somewhat uneasy that two functions with the same > > name returns different values. I'd like to hear others what they feel > > like about the behavior. > > I think that it's confusing to merge two different features into one > function. > Isn't it better to leave pg_get_backend_memory_contexts() as it is, > and > make the log-memory-contexts feature as separate function, e.g., > pg_log_backend_memory_contexts()? Yeah, that name looks better than pg_print_ba.. and I agree to separate the two features. Sorry for wagging the discussion back-and-forth. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 26 Mar 2021 12:49:16 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > On 2021/03/26 12:26, Fujii Masao wrote: > > On 2021/03/26 12:01, Kyotaro Horiguchi wrote: > >> At Thu, 25 Mar 2021 23:45:17 +0900, torikoshia > >> <torikoshia@oss.nttdata.com> wrote in > >>> Attached new one. > > Regarding the argument max_children, isn't it better to set its > > default value, > > e.g., 100 like MemoryContextStats() uses? > > + mcxtLogData->maxChildrenPerContext = max_children; > + > + if(!SendProcSignal(pid, PROCSIG_LOG_MEMORY_CONTEXT, > proc->backendId)) > + PG_RETURN_BOOL(true); > > Do we need memory barrier here? Otherwise, there is a race condition > where maxChildrenPerContext has not been set yet when the target > backend > that received signal read that shared variable. No? > > + Note that when multiple > + <function>pg_get_backend_memory_contexts</function> called in > + succession or simultaneously, <parameter>max_children</parameter> > can > + be the one of another > + <function>pg_get_backend_memory_contexts</function>. > + </para></entry> > > This might not be an issue in practice as Horiguchi-san said upthread. > But this looks a bit ugly and might be footgun later. The current > approach > using shared memory to pass this information to the target backends > might be overkill, and too complicated to polish the design at the > stage > just before feature freeze. So I'd withdraw my previous comment and > am OK to use the hard-coded value as max_children at the first version > of this feature. Thought? The dumper function silently suppresses logging when there are too many children. Suppressing a part of output when the user didn't told anything looks like just a misbehavior (even if it is written in the documentation..), especially when the suppressed contexts occupy the majority of memory usage. I think the fixed-limit of lines works if the lines are in descending order of memory usage. At least we need an additional line to inform the suppression. "some contexts are omitted" "n child contexts: total_bytes = ..." regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 26 Mar 2021 13:17:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Fri, 26 Mar 2021 12:49:16 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > On 2021/03/26 12:26, Fujii Masao wrote: > > > On 2021/03/26 12:01, Kyotaro Horiguchi wrote: > > >> At Thu, 25 Mar 2021 23:45:17 +0900, torikoshia > > >> <torikoshia@oss.nttdata.com> wrote in > > >>> Attached new one. > > > Regarding the argument max_children, isn't it better to set its > > > default value, > > > e.g., 100 like MemoryContextStats() uses? > > > > + mcxtLogData->maxChildrenPerContext = max_children; > > + > > + if(!SendProcSignal(pid, PROCSIG_LOG_MEMORY_CONTEXT, > > proc->backendId)) > > + PG_RETURN_BOOL(true); > > > > Do we need memory barrier here? Otherwise, there is a race condition > > where maxChildrenPerContext has not been set yet when the target > > backend > > that received signal read that shared variable. No? > > > > + Note that when multiple > > + <function>pg_get_backend_memory_contexts</function> called in > > + succession or simultaneously, <parameter>max_children</parameter> > > can > > + be the one of another > > + <function>pg_get_backend_memory_contexts</function>. > > + </para></entry> > > > > This might not be an issue in practice as Horiguchi-san said upthread. > > But this looks a bit ugly and might be footgun later. The current > > approach > > using shared memory to pass this information to the target backends > > might be overkill, and too complicated to polish the design at the > > stage > > just before feature freeze. So I'd withdraw my previous comment and > > am OK to use the hard-coded value as max_children at the first version > > of this feature. Thought? > > The dumper function silently suppresses logging when there are too > many children. Suppressing a part of output when the user didn't told > anything looks like just a misbehavior (even if it is written in the > documentation..), especially when the suppressed contexts occupy the > majority of memory usage. I think the fixed-limit of lines works if > the lines are in descending order of memory usage. > > At least we need an additional line to inform the suppression. > "some contexts are omitted" > "n child contexts: total_bytes = ..." Sorry I missed that is already implemented. So my opnion is I agree with limiting with a fixed-number, and preferablly sorted in descending order of... totalspace/nblocks? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/03/26 13:28, Kyotaro Horiguchi wrote: >> "some contexts are omitted" >> "n child contexts: total_bytes = ..." > > Sorry I missed that is already implemented. So my opnion is I agree > with limiting with a fixed-number, and preferablly sorted in > descending order of... totalspace/nblocks? This may be an improvement, but makes us modify MemoryContextStatsInternal() very much. I'm afraid that it's too late to do that at this stage... What about leaving the output order as it is at the first version? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Fri, 26 Mar 2021 14:02:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/03/26 13:28, Kyotaro Horiguchi wrote: > >> "some contexts are omitted" > >> "n child contexts: total_bytes = ..." > > Sorry I missed that is already implemented. So my opnion is I agree > > with limiting with a fixed-number, and preferablly sorted in > > descending order of... totalspace/nblocks? > > This may be an improvement, but makes us modify > MemoryContextStatsInternal() > very much. I'm afraid that it's too late to do that at this stage... > What about leaving the output order as it is at the first version? So I said "preferably":p (with a misspelling...) I'm fine with that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021-03-26 14:08, Kyotaro Horiguchi wrote: > At Fri, 26 Mar 2021 14:02:49 +0900, Fujii Masao > <masao.fujii@oss.nttdata.com> wrote in >> >> >> On 2021/03/26 13:28, Kyotaro Horiguchi wrote: >> >> "some contexts are omitted" >> >> "n child contexts: total_bytes = ..." >> > Sorry I missed that is already implemented. So my opnion is I agree >> > with limiting with a fixed-number, and preferablly sorted in >> > descending order of... totalspace/nblocks? >> >> This may be an improvement, but makes us modify >> MemoryContextStatsInternal() >> very much. I'm afraid that it's too late to do that at this stage... >> What about leaving the output order as it is at the first version? > > So I said "preferably":p (with a misspelling...) > I'm fine with that. > > regards. Thanks for the comments! Attached a new patch. It adds pg_log_backend_memory_contexts(pid) which logs memory contexts of the specified backend process. The number of child contexts to be logged per parent is limited to 100 as with MemoryContextStats(). As written in commit 7b5ef8f2d07, which limits the verbosity of memory context statistics dumps, it supposes that practical cases where the dump gets long will typically be huge numbers of siblings under the same parent context; while the additional debugging value from seeing details about individual siblings beyond 100 will not be large. Thoughts? Regards.
Attachment
On 2021/03/29 11:59, torikoshia wrote: > On 2021-03-26 14:08, Kyotaro Horiguchi wrote: >> At Fri, 26 Mar 2021 14:02:49 +0900, Fujii Masao >> <masao.fujii@oss.nttdata.com> wrote in >>> >>> >>> On 2021/03/26 13:28, Kyotaro Horiguchi wrote: >>> >> "some contexts are omitted" >>> >> "n child contexts: total_bytes = ..." >>> > Sorry I missed that is already implemented. So my opnion is I agree >>> > with limiting with a fixed-number, and preferablly sorted in >>> > descending order of... totalspace/nblocks? >>> >>> This may be an improvement, but makes us modify >>> MemoryContextStatsInternal() >>> very much. I'm afraid that it's too late to do that at this stage... >>> What about leaving the output order as it is at the first version? >> >> So I said "preferably":p (with a misspelling...) >> I'm fine with that. >> >> regards. > > Thanks for the comments! > > Attached a new patch. Thanks! > It adds pg_log_backend_memory_contexts(pid) which logs memory contexts > of the specified backend process. > > The number of child contexts to be logged per parent is limited to 100 > as with MemoryContextStats(). > > As written in commit 7b5ef8f2d07, which limits the verbosity of > memory context statistics dumps, it supposes that practical cases > where the dump gets long will typically be huge numbers of > siblings under the same parent context; while the additional > debugging value from seeing details about individual siblings > beyond 100 will not be large. > > Thoughts? I'm OK with 100. We should comment why we chose 100 for that. Here are some review comments. Isn't it better to move HandleProcSignalLogMemoryContext() and ProcessLogMemoryContextInterrupt() to mcxt.c from procsignal.c (like the functions for notify interrupt are defined in async.c) because they are the functions for memory contexts? + * HandleProcSignalLogMemoryContext + * + * Handle receipt of an interrupt indicating log memory context. + * Signal handler portion of interrupt handling. IMO it's better to comment why we need to separate the function into two, i.e., HandleProcSignalLogMemoryContext() and ProcessLogMemoryContextInterrupt(), like the comment for other similar function explains. What about the followings? ------------------------------- HandleLogMemoryContextInterrupt Handle receipt of an interrupt indicating logging of memory contexts. All the actual work is deferred to ProcessLogMemoryContextInterrupt(), because we cannot safely emit a log message inside the signal handler. ------------------------------- ProcessLogMemoryContextInterrupt Perform logging of memory contexts of this backend process. Any backend that participates in ProcSignal signaling must arrange to call this function if we see LogMemoryContextPending set. It is called from CHECK_FOR_INTERRUPTS(), which is enough because the target process for logging of memory contexts is a backend. ------------------------------- + if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT)) + HandleProcSignalLogMemoryContext(); + if (CheckProcSignal(PROCSIG_BARRIER)) HandleProcSignalBarrierInterrupt(); The code for memory context logging interrupt came after barrier interrupt in other places, e.g., procsignal.h. Why is this order of code different? +/* + * pg_log_backend_memory_contexts + * Print memory context of the specified backend process. Isn't it better to move pg_log_backend_memory_contexts() to mcxtfuncs.c from mcxt.c because this is the SQL function memory contexts? IMO we should comment why we allow only superuser to call this function. What about the following? ----------------- Signal a backend process to log its memory contexts. Only superusers are allowed to signal to log the memory contexts because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. ----------------- + PGPROC *proc = BackendPidGetProc(pid); + + /* Check whether the target process is PostgreSQL backend process. */ + if (proc == NULL) What about adding more comments as follows? --------------------------------- + /* + * BackendPidGetProc returns NULL if the pid isn't valid; but by the time + * we reach kill(), a process for which we get a valid proc here might + * have terminated on its own. There's no way to acquire a lock on an + * arbitrary process to prevent that. But since this mechanism is usually + * used to debug a backend running and consuming lots of memory, + * that it might end on its own first and its memory contexts are not + * logged is not a problem. + */ + if (proc == NULL) + { + /* + * This is just a warning so a loop-through-resultset will not abort + * if one backend logged its memory contexts during the run. + */ + ereport(WARNING, --------------------------------- + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be a superuser to log memory contexts"))); + PG_RETURN_BOOL(false); This PG_RETURN_BOOL(false) is unnecessary because ereport() will emit an ERROR? +$node->psql('postgres', 'select pg_log_backend_memory_contexts(pg_backend_pid())'); + +my $logfile = slurp_file($node->logfile); +like($logfile, qr/Grand total: \d+ bytes in \d+ blocks/, + 'found expected memory context print in the log file'); Isn't there the case where the memory contexts have not been logged yet when slurp_file() is called? It's rare, though. If there is the risk for that, we should wait for the memory contexts to be actually logged? BTW, I agree that we should have the regression test that calls pg_log_backend_memory_contexts() and checks, e.g., that it doesn't cause segmentation fault. But I'm just wondering if it's a bit overkill to make perl scriptand start new node to test only this function... Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021-03-30 02:28, Fujii Masao wrote: Thanks for reviewing and kind suggestions! >> It adds pg_log_backend_memory_contexts(pid) which logs memory contexts >> of the specified backend process. >> >> The number of child contexts to be logged per parent is limited to 100 >> as with MemoryContextStats(). >> >> As written in commit 7b5ef8f2d07, which limits the verbosity of >> memory context statistics dumps, it supposes that practical cases >> where the dump gets long will typically be huge numbers of >> siblings under the same parent context; while the additional >> debugging value from seeing details about individual siblings >> beyond 100 will not be large. >> >> Thoughts? > > I'm OK with 100. We should comment why we chose 100 for that. Added following comments. + /* + * When a backend process is consuming huge memory, logging all its + * memory contexts might overrun available disk space. To prevent + * this, we limit the number of child contexts per parent to 100. + * + * As with MemoryContextStats(), we suppose that practical cases + * where the dump gets long will typically be huge numbers of + * siblings under the same parent context; while the additional + * debugging value from seeing details about individual siblings + * beyond 100 will not be large. + */ + MemoryContextStatsDetail(TopMemoryContext, 100, false); > > Here are some review comments. > > Isn't it better to move HandleProcSignalLogMemoryContext() and > ProcessLogMemoryContextInterrupt() to mcxt.c from procsignal.c > (like the functions for notify interrupt are defined in async.c) > because they are the functions for memory contexts? Agreed. Also renamed HandleProcSignalLogMemoryContext to HandleLogMemoryContextInterrupt. > + * HandleProcSignalLogMemoryContext > + * > + * Handle receipt of an interrupt indicating log memory context. > + * Signal handler portion of interrupt handling. > > IMO it's better to comment why we need to separate the function into > two, > i.e., HandleProcSignalLogMemoryContext() and > ProcessLogMemoryContextInterrupt(), > like the comment for other similar function explains. What about the > followings? Thanks! Changed them to the suggested one. > ------------------------------- > HandleLogMemoryContextInterrupt > > Handle receipt of an interrupt indicating logging of memory contexts. > > All the actual work is deferred to ProcessLogMemoryContextInterrupt(), > because we cannot safely emit a log message inside the signal handler. > ------------------------------- > ProcessLogMemoryContextInterrupt > > Perform logging of memory contexts of this backend process. > > Any backend that participates in ProcSignal signaling must arrange to > call this function if we see LogMemoryContextPending set. It is called > from CHECK_FOR_INTERRUPTS(), which is enough because the target process > for logging of memory contexts is a backend. > ------------------------------- > > > + if (CheckProcSignal(PROCSIG_LOG_MEMORY_CONTEXT)) > + HandleProcSignalLogMemoryContext(); > + > if (CheckProcSignal(PROCSIG_BARRIER)) > HandleProcSignalBarrierInterrupt(); > > The code for memory context logging interrupt came after barrier > interrupt > in other places, e.g., procsignal.h. Why is this order of code > different? Fixed. > +/* > + * pg_log_backend_memory_contexts > + * Print memory context of the specified backend process. > > Isn't it better to move pg_log_backend_memory_contexts() to mcxtfuncs.c > from mcxt.c because this is the SQL function memory contexts? Agreed. > IMO we should comment why we allow only superuser to call this > function. > What about the following? Thanks! Modified the patch according to the suggestions. > ----------------- > Signal a backend process to log its memory contexts. > > Only superusers are allowed to signal to log the memory contexts > because allowing any users to issue this request at an unbounded rate > would cause lots of log messages and which can lead to denial of > service. > ----------------- > > + PGPROC *proc = BackendPidGetProc(pid); > + > + /* Check whether the target process is PostgreSQL backend process. */ > + if (proc == NULL) > > What about adding more comments as follows? > > --------------------------------- > + /* > + * BackendPidGetProc returns NULL if the pid isn't valid; but by the > time > + * we reach kill(), a process for which we get a valid proc here > might > + * have terminated on its own. There's no way to acquire a lock on > an > + * arbitrary process to prevent that. But since this mechanism is > usually > + * used to debug a backend running and consuming lots of memory, > + * that it might end on its own first and its memory contexts are not > + * logged is not a problem. > + */ > + if (proc == NULL) > + { > + /* > + * This is just a warning so a loop-through-resultset will not abort > + * if one backend logged its memory contexts during the run. > + */ > + ereport(WARNING, > --------------------------------- > > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("must be a superuser to log memory contexts"))); > + PG_RETURN_BOOL(false); > > This PG_RETURN_BOOL(false) is unnecessary because ereport() will emit > an ERROR? > > +$node->psql('postgres', 'select > pg_log_backend_memory_contexts(pg_backend_pid())'); > + > +my $logfile = slurp_file($node->logfile); > +like($logfile, qr/Grand total: \d+ bytes in \d+ blocks/, > + 'found expected memory context print in the log file'); > > Isn't there the case where the memory contexts have not been logged yet > when slurp_file() is called? It's rare, though. If there is the risk > for that, > we should wait for the memory contexts to be actually logged? > > BTW, I agree that we should have the regression test that calls > pg_log_backend_memory_contexts() and checks, e.g., that it doesn't > cause > segmentation fault. But I'm just wondering if it's a bit overkill to > make perl > scriptand start new node to test only this function... OK, I removed the perl TAP test and added a regression test running pg_log_backend_memory_contexts(pg_backend_pid()) and checking it returns 't'. Regards.
Attachment
On 2021/03/30 22:06, torikoshia wrote: > Modified the patch according to the suggestions. Thanks for updating the patch! I applied the cosmetic changes to the patch and added the example of the function call into the document. Attached is the updated version of the patch. Could you check this version? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021-03-31 04:36, Fujii Masao wrote: > On 2021/03/30 22:06, torikoshia wrote: >> Modified the patch according to the suggestions. > > Thanks for updating the patch! > > I applied the cosmetic changes to the patch and added the example of > the function call into the document. Attached is the updated version > of the patch. Could you check this version? > Thanks a lot! +The memory contexts will be logged in the log file. For example: When 'log_destination = stderr' and 'logging_collector = off', it does not log in the file but in the stderr. Description like below would be a bit more accurate but I'm wondering it's repeating the same words. + The memory contexts will be logged based on the log configuration set. For example: How do you think? +<programlisting> +postgres=# SELECT pg_log_backend_memory_contexts(pg_backend_pid()); + pg_log_backend_memory_contexts +-------------------------------- + t +(1 row) + +The memory contexts will be logged in the log file. For example: +LOG: logging memory contexts of PID 10377 +STATEMENT: SELECT pg_log_backend_memory_contexts(pg_backend_pid()); +LOG: level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 chunks); 66368 used +LOG: level: 1; pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used The line "The memory contexts will be logged in the log file. For example:" is neither nor SQL command and its outputs, it might be better to differentiate it. What about the following like attached patch? +<programlisting> +postgres=# SELECT pg_log_backend_memory_contexts(pg_backend_pid()); + pg_log_backend_memory_contexts +-------------------------------- + t +(1 row) +</programlisting> +The memory contexts will be logged in the log file. For example: +<screen> +LOG: logging memory contexts of PID 10377 +STATEMENT: SELECT pg_log_backend_memory_contexts(pg_backend_pid()); +LOG: level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 chunks); 66368 used +LOG: level: 1; pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used ...(snip)... +LOG: level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 264 used +LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 used +</screen> Regards.
Attachment
At Wed, 31 Mar 2021 15:02:02 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in > On 2021-03-31 04:36, Fujii Masao wrote: > > On 2021/03/30 22:06, torikoshia wrote: > >> Modified the patch according to the suggestions. > > Thanks for updating the patch! > > I applied the cosmetic changes to the patch and added the example of > > the function call into the document. Attached is the updated version > > of the patch. Could you check this version? > > > > Thanks a lot! > > > +The memory contexts will be logged in the log file. For example: > > When 'log_destination = stderr' and 'logging_collector = off', it does > not log in the file but in the stderr. > > Description like below would be a bit more accurate but I'm wondering > it's repeating the same words. > > + The memory contexts will be logged based on the log configuration > set. For example: > > How do you think? How about "The memory contexts will be logged in the server log" ? I think "server log" doesn't suggest any concrete target. > +<programlisting> > +postgres=# SELECT pg_log_backend_memory_contexts(pg_backend_pid()); > + pg_log_backend_memory_contexts > +-------------------------------- > + t > +(1 row) > + > +The memory contexts will be logged in the log file. For example: > +LOG: logging memory contexts of PID 10377 > +STATEMENT: SELECT pg_log_backend_memory_contexts(pg_backend_pid()); > +LOG: level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free > (5 chunks); 66368 used > +LOG: level: 1; pgstat TabStatusArray lookup hash table: 8192 total in > 1 blocks; 1408 free (0 chunks); 6784 used > > The line "The memory contexts will be logged in the log file. For > example:" > is neither nor SQL command and its outputs, it might be better to > differentiate it. > > > What about the following like attached patch? > > +<programlisting> > +postgres=# SELECT pg_log_backend_memory_contexts(pg_backend_pid()); > + pg_log_backend_memory_contexts > +-------------------------------- > + t > +(1 row) > +</programlisting> > +The memory contexts will be logged in the log file. For example: > +<screen> > +LOG: logging memory contexts of PID 10377 > +STATEMENT: SELECT pg_log_backend_memory_contexts(pg_backend_pid()); > +LOG: level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free > (5 chunks); 66368 used > +LOG: level: 1; pgstat TabStatusArray lookup hash table: 8192 total in > 1 blocks; 1408 free (0 chunks); 6784 used > ...(snip)... > +LOG: level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (3 > chunks); 264 used > +LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 > chunks); 1029560 used > +</screen> +1 from me. It looks like more compliant with the standard? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/03/31 15:16, Kyotaro Horiguchi wrote: >> + The memory contexts will be logged based on the log configuration >> set. For example: >> >> How do you think? > > How about "The memory contexts will be logged in the server log" ? > I think "server log" doesn't suggest any concrete target. Or just using "logged" is enough? Also I'd like to document that one message for each memory context is logged. So what about the following? One message for each memory context will be logged. For example, > +1 from me. It looks like more compliant with the standard? +1, too. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021-04-01 19:13, Fujii Masao wrote: > On 2021/03/31 15:16, Kyotaro Horiguchi wrote: >>> + The memory contexts will be logged based on the log configuration >>> set. For example: >>> >>> How do you think? >> >> How about "The memory contexts will be logged in the server log" ? >> I think "server log" doesn't suggest any concrete target. > > Or just using "logged" is enough? > > Also I'd like to document that one message for each memory context is > logged. > So what about the following? > > One message for each memory context will be logged. For example, Agreed. BTW, there was a conflict since c30f54ad732(Detect POLLHUP/POLLRDHUP while running queries), attached v9. Regards,
Attachment
On Sun, Apr 4, 2021 at 7:56 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
On 2021-04-01 19:13, Fujii Masao wrote:
> On 2021/03/31 15:16, Kyotaro Horiguchi wrote:
>>> + The memory contexts will be logged based on the log configuration
>>> set. For example:
>>>
>>> How do you think?
>>
>> How about "The memory contexts will be logged in the server log" ?
>> I think "server log" doesn't suggest any concrete target.
>
> Or just using "logged" is enough?
>
> Also I'd like to document that one message for each memory context is
> logged.
> So what about the following?
>
> One message for each memory context will be logged. For example,
Agreed.
BTW, there was a conflict since c30f54ad732(Detect POLLHUP/POLLRDHUP
while
running queries), attached v9.
Regards,
Hi,
+ * handler, and then which causes the next CHECK_FOR_INTERRUPTS()
I think the 'and then' is not needed:
handler which causes the next ...
+ * This is just a warning so a loop-through-resultset will not abort
+ * if one backend logged its memory contexts during the run.
+ * if one backend logged its memory contexts during the run.
The pid given by arg 0 is not a PostgreSQL server process. Which other backend could it be ?
Thanks
On 2021/04/05 12:20, Zhihong Yu wrote: > + * This is just a warning so a loop-through-resultset will not abort > + * if one backend logged its memory contexts during the run. > > The pid given by arg 0 is not a PostgreSQL server process. Which other backend could it be ? This is the comment that I added wrongly. So the comment should be "This is just a warning so a loop-through-resultset will not abort if one backend terminated on its own during the run.", like pg_signal_backend(). Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021-04-05 12:59, Fujii Masao wrote: > On 2021/04/05 12:20, Zhihong Yu wrote: Thanks for reviewing! >> + * On receipt of this signal, a backend sets the flag in the signal >> + * handler, and then which causes the next CHECK_FOR_INTERRUPTS() >> I think the 'and then' is not needed: Although I wonder either would be fine, removed the words. >> + * This is just a warning so a loop-through-resultset will not >> abort >> + * if one backend logged its memory contexts during the run. >> >> The pid given by arg 0 is not a PostgreSQL server process. Which other >> backend could it be ? > > This is the comment that I added wrongly. So the comment should be > "This is just a warning so a loop-through-resultset will not abort > if one backend terminated on its own during the run.", > like pg_signal_backend(). Thought? +1. Attached v10 patch. Regards,
Attachment
On 2021/04/05 21:03, torikoshia wrote: > On 2021-04-05 12:59, Fujii Masao wrote: >> On 2021/04/05 12:20, Zhihong Yu wrote: > > Thanks for reviewing! > >>> + * On receipt of this signal, a backend sets the flag in the signal >>> + * handler, and then which causes the next CHECK_FOR_INTERRUPTS() > >>> I think the 'and then' is not needed: > > Although I wonder either would be fine, removed the words. > >>> + * This is just a warning so a loop-through-resultset will not abort >>> + * if one backend logged its memory contexts during the run. >>> >>> The pid given by arg 0 is not a PostgreSQL server process. Which other backend could it be ? >> >> This is the comment that I added wrongly. So the comment should be >> "This is just a warning so a loop-through-resultset will not abort >> if one backend terminated on its own during the run.", >> like pg_signal_backend(). Thought? > > +1. > > Attached v10 patch. Thanks for updating the patch! I updated the patch as follows. Could you check the attached patch? + Memory contexts will be logged based on the log configuration set. + See <xref linkend="runtime-config-logging"/> for more information. Those memory contexts are logged at LOG level, but they are not sent to a client whatever the setting of client_min_messages. I think this information should be documented. So I updated the document as follows. These memory contexts will be logged at <literal>LOG</literal> message level. They will appear in the server log based on the log configuration set (See <xref linkend="runtime-config-logging"/> for more information), but will not be sent to the client whatever the setting of <xref linkend="guc-client-min-messages"/>. + Only superusers can log the memory contexts. We can read this description as "only superusers can request to log ...". But ISTM that we can also read this as "only superusers can log (dump) the memory contexts of its backend". Right? To avoid this confusion, I updated this description as follows. Only superusers can request to log the memory contexts. I added the following note about the performance overhead by this function. Note that frequent calls to this function could incur significant overhead, because it may generate a large number of log messages. I also added some comments. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021-04-06 00:08, Fujii Masao wrote: > On 2021/04/05 21:03, torikoshia wrote: >> On 2021-04-05 12:59, Fujii Masao wrote: >>> On 2021/04/05 12:20, Zhihong Yu wrote: >> >> Thanks for reviewing! >> >>>> + * On receipt of this signal, a backend sets the flag in the signal >>>> + * handler, and then which causes the next CHECK_FOR_INTERRUPTS() >> >>>> I think the 'and then' is not needed: >> >> Although I wonder either would be fine, removed the words. >> >>>> + * This is just a warning so a loop-through-resultset will >>>> not abort >>>> + * if one backend logged its memory contexts during the run. >>>> >>>> The pid given by arg 0 is not a PostgreSQL server process. Which >>>> other backend could it be ? >>> >>> This is the comment that I added wrongly. So the comment should be >>> "This is just a warning so a loop-through-resultset will not abort >>> if one backend terminated on its own during the run.", >>> like pg_signal_backend(). Thought? >> >> +1. >> >> Attached v10 patch. > > Thanks for updating the patch! > > I updated the patch as follows. Could you check the attached patch? Thanks a lot! I don't have any objections to your improvements. Regards,
On 2021/04/06 10:57, torikoshia wrote: > I don't have any objections to your improvements. Thanks for the check! I pushed the patch. Thanks a lot! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION