Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | torikoshia |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | 4af7712833d4ce989cf5e658106e42da@oss.nttdata.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (James Coleman <jtc331@gmail.com>) |
Responses |
Re: RFC: Logging plan of the running query
|
List | pgsql-hackers |
On 2023-08-28 22:47, James Coleman wrote: > On Mon, Aug 28, 2023 at 3:01 AM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> On 2023-08-26 21:03, James Coleman wrote: >> > On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331@gmail.com> wrote: >> >> >> >> On Thu, Aug 17, 2023 at 10:02 AM torikoshia >> >> <torikoshia@oss.nttdata.com> wrote: >> >> > >> >> > On 2023-06-16 01:34, James Coleman wrote: >> >> > > Attached is v28 >> >> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish >> >> > > when necessary. Once built with those two patches I'm simply running >> >> > > `make check`. >> >> > >> >> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume >> >> > huge >> >> > amount of memory and under some environments they were terminated by OOM >> >> > killer. >> >> > >> >> > This was because memory was allocated from existing memory contexts and >> >> > they >> >> > were not freed after ProcessLogQueryPlanInterrupt(). >> >> > Updated the patch to use dedicated memory context for >> >> > ProcessLogQueryPlanInterrupt(). >> >> > >> >> > Applying attached patch and v28-0002 patch, `make check` successfully >> >> > completed after 20min and 50GB of logs on my environment. >> >> > >> >> > >>> On 2023-06-15 01:48, James Coleman wrote: >> >> > >>> > The tests have been running since last night, but have been apparently >> >> > >>> > hung now for many hours. >> >> > >> >> > I don't know if this has anything to do with the hung you faced, but I >> >> > thought >> >> > it might be possible that the large amount of memory usage resulted in >> >> > swapping, which caused a significant delay in processing. >> >> >> >> Ah, yes, I think that could be a possible explanation. I was delaying >> >> on this thread because I wasn't comfortable with having caused an >> >> issue once (even if I couldn't easily reproduce) without at least some >> >> theory as to the cause (and a fix). >> >> >> >> > If possible, I would be very grateful if you could try to reproduce this >> >> > with >> >> > the v29 patch. >> >> >> >> I'll kick off some testing. >> >> >> > >> > I don't have time to investigate what's happening here, but 24 hours >> > later the first "make check" is still running, and at first glance it >> > seems to have the same behavior I'd seen that first time. The test >> > output is to this point: >> > >> > # parallel group (5 tests): index_including create_view >> > index_including_gist create_index create_index_spgist >> > ok 66 + create_index 26365 ms >> > ok 67 + create_index_spgist 27675 ms >> > ok 68 + create_view 1235 ms >> > ok 69 + index_including 1102 ms >> > ok 70 + index_including_gist 1633 ms >> > # parallel group (16 tests): create_aggregate create_cast errors >> > roleattributes drop_if_exists hash_func typed_table create_am >> > infinite_recurse >> > >> > and it hasn't progressed past that point since at least ~16 hours ago >> > (the first several hours of the run I wasn't monitoring it). >> > >> > I haven't connected up gdb yet, and won't be able to until maybe >> > tomorrow, but here's the ps output for postgres processes that are >> > running: >> > >> > admin 3213249 0.0 0.0 196824 20552 ? Ss Aug25 0:00 >> > /home/admin/postgresql-test/bin/postgres -D >> > /home/admin/postgresql-test-data >> > admin 3213250 0.0 0.0 196964 7284 ? Ss Aug25 0:00 >> > postgres: checkpointer >> > admin 3213251 0.0 0.0 196956 4276 ? Ss Aug25 0:00 >> > postgres: background writer >> > admin 3213253 0.0 0.0 196956 8600 ? Ss Aug25 0:00 >> > postgres: walwriter >> > admin 3213254 0.0 0.0 198424 7316 ? Ss Aug25 0:00 >> > postgres: autovacuum launcher >> > admin 3213255 0.0 0.0 198412 5488 ? Ss Aug25 0:00 >> > postgres: logical replication launcher >> > admin 3237967 0.0 0.0 2484 516 pts/4 S+ Aug25 0:00 >> > /bin/sh -c echo "# +++ regress check in src/test/regress +++" && >> > PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH" >> > LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib" >> > INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template >> > ../../../src/test/regress/pg_regress --temp-instance=./tmp_check >> > --inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20 >> > --schedule=./parallel_schedule >> > admin 3237973 0.0 0.0 197880 20688 pts/4 S+ Aug25 0:00 >> > postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c >> > listen_addresses= -k /tmp/pg_regress-7mmGUa >> > admin 3237976 0.0 0.1 198332 44608 ? Ss Aug25 0:00 >> > postgres: checkpointer >> > admin 3237977 0.0 0.0 198032 4640 ? Ss Aug25 0:00 >> > postgres: background writer >> > admin 3237979 0.0 0.0 197880 8580 ? Ss Aug25 0:00 >> > postgres: walwriter >> > admin 3237980 0.0 0.0 199484 7608 ? Ss Aug25 0:00 >> > postgres: autovacuum launcher >> > admin 3237981 0.0 0.0 199460 5488 ? Ss Aug25 0:00 >> > postgres: logical replication launcher >> > admin 3243644 0.0 0.2 252400 74656 ? Ss Aug25 0:01 >> > postgres: admin regression [local] SELECT waiting >> > admin 3243645 0.0 0.1 205480 33992 ? Ss Aug25 0:00 >> > postgres: admin regression [local] SELECT waiting >> > admin 3243654 99.9 0.0 203156 31504 ? Rs Aug25 1437:49 >> > postgres: admin regression [local] VACUUM >> > admin 3243655 0.0 0.1 212036 38504 ? Ss Aug25 0:00 >> > postgres: admin regression [local] SELECT waiting >> > admin 3243656 0.0 0.0 206024 30892 ? Ss Aug25 0:00 >> > postgres: admin regression [local] DELETE waiting >> > admin 3243657 0.0 0.1 205568 32232 ? Ss Aug25 0:00 >> > postgres: admin regression [local] ALTER TABLE waiting >> > admin 3243658 0.0 0.0 203740 21532 ? Ss Aug25 0:00 >> > postgres: admin regression [local] ANALYZE waiting >> > admin 3243798 0.0 0.0 199884 8464 ? Ss Aug25 0:00 >> > postgres: autovacuum worker >> > admin 3244733 0.0 0.0 199492 5956 ? Ss Aug25 0:00 >> > postgres: autovacuum worker >> > admin 3245652 0.0 0.0 199884 8468 ? Ss Aug25 0:00 >> > postgres: autovacuum worker >> > >> > As you can see there are a bunch of backends presumably waiting, and >> > also the VACUUM process has been pegging a single CPU core for at >> > least since that ~16 hour ago mark. >> > >> > I hope to be able to do more investigation later, but I wanted to at >> > least give you this information now. >> >> Thanks a lot for testing the patch! >> I really appreciate your cooperation. >> >> Hmm, I also tested on the current HEAD(165d581f146b09) again on Ubuntu >> 22.04 and macOS, but unfortunately(fortunately?) they succeeded as >> below: >> >> ``` >> $ git apply v29-0001-Add-function-to-log-the-plan-of-the-query.patch >> $ git apply >> v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch >> $ ./configure --enable-debug --enable-cassert >> $ make >> $ make check >> >> ...(snip)... >> >> # parallel group (5 tests): index_including index_including_gist >> create_view create_index create_index_spgist >> ok 66 + create_index 25033 ms >> ok 67 + create_index_spgist 26144 ms >> ok 68 + create_view 3061 ms >> ok 69 + index_including 976 ms >> ok 70 + index_including_gist 2998 ms >> # parallel group (16 tests): create_cast errors create_aggregate >> roleattributes drop_if_exists hash_func typed_table >> create_am select constraints updatable_views inherit triggers vacuum >> create_function_sql infinite_recurse >> ok 71 + create_aggregate 225 ms >> ok 72 + create_function_sql 18874 ms >> ok 73 + create_cast 168 ms >> >> ...(snip)... >> >> # All 215 tests passed. >> ``` >> >> If you notice any difference, I would be grateful if you could let me >> know. > > I've never been able to reproduce it (haven't tested the new version, > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian > (first buster and now bullseye). > > I'm attaching several stacktraces in the hope that they provide some > clues. These all match the ps output I sent earlier, though note in > that output there is both the regress instance and my test instance > (pid 3213249) running (different ports, of course, and they are from > the exact same compilation run). I've attached ps output for the > postgres processes under the make check process to simplify cross > referencing. > > A few interesting things: > - There's definitely a lock on a relation that seems to be what's > blocking the processes. > - When I try to connect with psql the process forks but then hangs > (see the ps output with task names stuck in "authentication"). I've > also included a trace from one of these. Thanks for sharing them! Many processes are waiting to acquire the LW lock, including the process trying to output the plan(select1.trace). I suspect that this is due to a lock that was acquired prior to being interrupted by ProcessLogQueryPlanInterrupt(), but have not been able to reproduce the same situation.. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
pgsql-hackers by date: