Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

From torikoshia
Subject Re: RFC: Logging plan of the running query
Date
Msg-id 4af7712833d4ce989cf5e658106e42da@oss.nttdata.com
Whole thread Raw
In response to Re: RFC: Logging plan of the running query  (James Coleman <jtc331@gmail.com>)
Responses Re: RFC: Logging plan of the running query
List pgsql-hackers
On 2023-08-28 22:47, James Coleman wrote:
> On Mon, Aug 28, 2023 at 3:01 AM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> On 2023-08-26 21:03, James Coleman wrote:
>> > On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331@gmail.com> wrote:
>> >>
>> >> On Thu, Aug 17, 2023 at 10:02 AM torikoshia
>> >> <torikoshia@oss.nttdata.com> wrote:
>> >> >
>> >> > On 2023-06-16 01:34, James Coleman wrote:
>> >> > > Attached is v28
>> >> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
>> >> > > when necessary. Once built with those two patches I'm simply running
>> >> > > `make check`.
>> >> >
>> >> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
>> >> > huge
>> >> > amount of memory and under some environments they were terminated by OOM
>> >> > killer.
>> >> >
>> >> > This was because memory was allocated from existing memory contexts and
>> >> > they
>> >> > were not freed after ProcessLogQueryPlanInterrupt().
>> >> > Updated the patch to use dedicated memory context for
>> >> > ProcessLogQueryPlanInterrupt().
>> >> >
>> >> > Applying attached patch and v28-0002 patch, `make check` successfully
>> >> > completed after 20min and 50GB of logs on my environment.
>> >> >
>> >> > >>> On 2023-06-15 01:48, James Coleman wrote:
>> >> > >>> > The tests have been running since last night, but have been apparently
>> >> > >>> > hung now for many hours.
>> >> >
>> >> > I don't know if this has anything to do with the hung you faced, but I
>> >> > thought
>> >> > it might be possible that the large amount of memory usage resulted in
>> >> > swapping, which caused a significant delay in processing.
>> >>
>> >> Ah, yes, I think that could be a possible explanation. I was delaying
>> >> on this thread because I wasn't comfortable with having caused an
>> >> issue once (even if I couldn't easily reproduce) without at least some
>> >> theory as to the cause (and a fix).
>> >>
>> >> > If possible, I would be very grateful if you could try to reproduce this
>> >> > with
>> >> > the v29 patch.
>> >>
>> >> I'll kick off some testing.
>> >>
>> >
>> > I don't have time to investigate what's happening here, but 24 hours
>> > later the first "make check" is still running, and at first glance it
>> > seems to have the same behavior I'd seen that first time. The test
>> > output is to this point:
>> >
>> > # parallel group (5 tests):  index_including create_view
>> > index_including_gist create_index create_index_spgist
>> > ok 66        + create_index                            26365 ms
>> > ok 67        + create_index_spgist                     27675 ms
>> > ok 68        + create_view                              1235 ms
>> > ok 69        + index_including                          1102 ms
>> > ok 70        + index_including_gist                     1633 ms
>> > # parallel group (16 tests):  create_aggregate create_cast errors
>> > roleattributes drop_if_exists hash_func typed_table create_am
>> > infinite_recurse
>> >
>> > and it hasn't progressed past that point since at least ~16 hours ago
>> > (the first several hours of the run I wasn't monitoring it).
>> >
>> > I haven't connected up gdb yet, and won't be able to until maybe
>> > tomorrow, but here's the ps output for postgres processes that are
>> > running:
>> >
>> > admin    3213249  0.0  0.0 196824 20552 ?        Ss   Aug25   0:00
>> > /home/admin/postgresql-test/bin/postgres -D
>> > /home/admin/postgresql-test-data
>> > admin    3213250  0.0  0.0 196964  7284 ?        Ss   Aug25   0:00
>> > postgres: checkpointer
>> > admin    3213251  0.0  0.0 196956  4276 ?        Ss   Aug25   0:00
>> > postgres: background writer
>> > admin    3213253  0.0  0.0 196956  8600 ?        Ss   Aug25   0:00
>> > postgres: walwriter
>> > admin    3213254  0.0  0.0 198424  7316 ?        Ss   Aug25   0:00
>> > postgres: autovacuum launcher
>> > admin    3213255  0.0  0.0 198412  5488 ?        Ss   Aug25   0:00
>> > postgres: logical replication launcher
>> > admin    3237967  0.0  0.0   2484   516 pts/4    S+   Aug25   0:00
>> > /bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
>> >
PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
>> > LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib"
>> > INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template
>> > ../../../src/test/regress/pg_regress --temp-instance=./tmp_check
>> > --inputdir=. --bindir=     --dlpath=. --max-concurrent-tests=20
>> > --schedule=./parallel_schedule
>> > admin    3237973  0.0  0.0 197880 20688 pts/4    S+   Aug25   0:00
>> > postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c
>> > listen_addresses= -k /tmp/pg_regress-7mmGUa
>> > admin    3237976  0.0  0.1 198332 44608 ?        Ss   Aug25   0:00
>> > postgres: checkpointer
>> > admin    3237977  0.0  0.0 198032  4640 ?        Ss   Aug25   0:00
>> > postgres: background writer
>> > admin    3237979  0.0  0.0 197880  8580 ?        Ss   Aug25   0:00
>> > postgres: walwriter
>> > admin    3237980  0.0  0.0 199484  7608 ?        Ss   Aug25   0:00
>> > postgres: autovacuum launcher
>> > admin    3237981  0.0  0.0 199460  5488 ?        Ss   Aug25   0:00
>> > postgres: logical replication launcher
>> > admin    3243644  0.0  0.2 252400 74656 ?        Ss   Aug25   0:01
>> > postgres: admin regression [local] SELECT waiting
>> > admin    3243645  0.0  0.1 205480 33992 ?        Ss   Aug25   0:00
>> > postgres: admin regression [local] SELECT waiting
>> > admin    3243654 99.9  0.0 203156 31504 ?        Rs   Aug25 1437:49
>> > postgres: admin regression [local] VACUUM
>> > admin    3243655  0.0  0.1 212036 38504 ?        Ss   Aug25   0:00
>> > postgres: admin regression [local] SELECT waiting
>> > admin    3243656  0.0  0.0 206024 30892 ?        Ss   Aug25   0:00
>> > postgres: admin regression [local] DELETE waiting
>> > admin    3243657  0.0  0.1 205568 32232 ?        Ss   Aug25   0:00
>> > postgres: admin regression [local] ALTER TABLE waiting
>> > admin    3243658  0.0  0.0 203740 21532 ?        Ss   Aug25   0:00
>> > postgres: admin regression [local] ANALYZE waiting
>> > admin    3243798  0.0  0.0 199884  8464 ?        Ss   Aug25   0:00
>> > postgres: autovacuum worker
>> > admin    3244733  0.0  0.0 199492  5956 ?        Ss   Aug25   0:00
>> > postgres: autovacuum worker
>> > admin    3245652  0.0  0.0 199884  8468 ?        Ss   Aug25   0:00
>> > postgres: autovacuum worker
>> >
>> > As you can see there are a bunch of backends presumably waiting, and
>> > also the VACUUM process has been pegging a single CPU core for at
>> > least since that ~16 hour ago mark.
>> >
>> > I hope to be able to do more investigation later, but I wanted to at
>> > least give you this information now.
>> 
>> Thanks a lot for testing the patch!
>> I really appreciate your cooperation.
>> 
>> Hmm, I also tested on the current HEAD(165d581f146b09) again on Ubuntu
>> 22.04 and macOS, but unfortunately(fortunately?) they succeeded as
>> below:
>> 
>> ```
>> $ git apply v29-0001-Add-function-to-log-the-plan-of-the-query.patch
>> $ git apply 
>> v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
>> $ ./configure --enable-debug --enable-cassert
>> $ make
>> $ make check
>> 
>> ...(snip)...
>> 
>> # parallel group (5 tests):  index_including index_including_gist
>> create_view create_index create_index_spgist
>> ok 66        + create_index                            25033 ms
>> ok 67        + create_index_spgist                     26144 ms
>> ok 68        + create_view                              3061 ms
>> ok 69        + index_including                           976 ms
>> ok 70        + index_including_gist                     2998 ms
>> # parallel group (16 tests):  create_cast errors create_aggregate
>> roleattributes drop_if_exists hash_func typed_table
>> create_am select constraints updatable_views inherit triggers vacuum
>> create_function_sql infinite_recurse
>> ok 71        + create_aggregate                          225 ms
>> ok 72        + create_function_sql                     18874 ms
>> ok 73        + create_cast                               168 ms
>> 
>> ...(snip)...
>> 
>> # All 215 tests passed.
>> ```
>> 
>> If you notice any difference, I would be grateful if you could let me
>> know.
> 
> I've never been able to reproduce it (haven't tested the new version,
> but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> (first buster and now bullseye).
> 
> I'm attaching several stacktraces in the hope that they provide some
> clues. These all match the ps output I sent earlier, though note in
> that output there is both the regress instance and my test instance
> (pid 3213249) running (different ports, of course, and they are from
> the exact same compilation run). I've attached ps output for the
> postgres processes under the make check process to simplify cross
> referencing.
> 
> A few interesting things:
> - There's definitely a lock on a relation that seems to be what's
> blocking the processes.
> - When I try to connect with psql the process forks but then hangs
> (see the ps output with task names stuck in "authentication"). I've
> also included a trace from one of these.

Thanks for sharing them!

Many processes are waiting to acquire the LW lock, including the process 
trying to output the plan(select1.trace).

I suspect that this is due to a lock that was acquired prior to being 
interrupted by ProcessLogQueryPlanInterrupt(), but have not been able to 
reproduce the same situation..

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: XLog size reductions: smaller XLRec block header for PG17
Next
From: Robert Haas
Date:
Subject: Re: Initdb-time block size specification