Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | torikoshia |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | a82b1a3f677d2a3aa95083c394e12843@oss.nttdata.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Responses |
Re: RFC: Logging plan of the running query
|
List | pgsql-hackers |
On 2024-02-07 13:58, Ashutosh Bapat wrote: > On Wed, Feb 7, 2024 at 9:38 AM torikoshia <torikoshia@oss.nttdata.com> > wrote: >> >> Hi Ashutosh, >> >> On 2024-02-06 19:51, Ashutosh Bapat wrote: >> >> > Thanks for the summary. It is helpful. I think patch is also getting >> > better. >> > >> > I have a few questions and suggestions >> >> Thanks for your comments. >> >> > 1. Prologue of GetLockMethodLocalHash() mentions >> > * NOTE: When there are many entries in LockMethodLocalHash, calling >> > this >> > * function and looking into all of them can lead to performance >> > problems. >> > */ >> > How bad this performance could be. Let's assume that a query is taking >> > time and pg_log_query_plan() is invoked to examine the plan of this >> > query. Is it possible that the looping over all the locks itself takes >> > a lot of time delaying the query execution further? >> >> I think it depends on the number of local locks, but I've measured cpu >> time for this page lock check by adding below codes and >> v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which >> calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just >> for >> your information: >> >> diff --git a/src/backend/commands/explain.c >> b/src/backend/commands/explain.c >> index 5f7d77d567..65b7cb4925 100644 >> --- a/src/backend/commands/explain.c >> +++ b/src/backend/commands/explain.c >> @@ -44,6 +44,8 @@ >> >> +#include "time.h" >> ... >> @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void) >> * we check all the LocalLock entries and when finding even >> one, give up >> * logging the plan. >> */ >> + start = clock(); >> hash_seq_init(&status, GetLockMethodLocalHash()); >> while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) >> != >> NULL) >> { >> if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE) >> { >> ereport(LOG_SERVER_ONLY, >> errmsg("ignored request for logging query plan >> due >> to page lock conflicts"), >> errdetail("You can try again in a moment.")); >> hash_seq_term(&status); >> >> ProcessLogQueryPlanInterruptActive = false; >> return; >> } >> } >> + end = clock(); >> + cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC; >> + >> + ereport(LOG, >> + errmsg("all locallock entry search took: %f", >> cpu_time_used)); >> + >> >> There were about 3 million log lines which recorded the cpu time, and >> the duration was quite short: >> >> =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d. >> =# select max(d), min(d), avg(d) from cpu_time ; >> max | min | avg >> ----------+-----+----------------------- >> 0.000116 | 0 | 4.706274625332238e-07 >> >> I'm not certain that this is valid for actual use cases, but these >> results seem to suggest that it will not take that long. > > What load did you run? I don't think any query in make check would > take say thousands of locks. Sorry, I forgot to write it but ran make check as you imagined. > The prologue refers to a very populated > lock hash table. I think that will happen if thousands of tables are > queried in a single query OR a query runs on a partitioned table with > thousands of partitions. May be we want to try that scenario. OK, I'll try such cases. >> > 2. What happens if auto_explain is enabled in the backend and >> > pg_log_query_plan() is called on the same backend? Will they conflict? >> > I think we should add a test for the same. >> >> Hmm, I think they don't conflict since they just refer QueryDesc and >> don't modify it and don't use same objects for locking. >> (I imagine 'conflict' here is something like 'hard conflict' in >> replication[2].) > > By conflict, I mean the two features behave weird when used together > e.g give wrong results or crash etc. > >> >> Actually using both auto_explain and pg_log_query_plan() output each >> logs separately: >> >> (pid:62835)=# select pg_sleep(10); >> (pid:70000)=# select pg_log_query_plan(62835); >> >> (pid:70000)=# \! cat data/log/postgres.log >> ... >> 2024-02-06 21:44:17.837 JST [62835:4:0] LOG: 00000: query plan >> running on backend with PID 62835 is: >> Query Text: select pg_sleep(10); >> Result (cost=0.00..0.01 rows=1 width=4) >> Output: pg_sleep('10'::double precision) >> Query Identifier: 3506829283127886044 >> 2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION: >> ProcessLogQueryPlanInterrupt, explain.c:5336 >> 2024-02-06 21:44:26.974 JST [62835:6:0] LOG: 00000: duration: >> 10000.868 ms plan: >> Query Text: select pg_sleep(10); >> Result (cost=0.00..0.01 rows=1 width=4) (actual >> time=10000.802..10000.804 rows=1 loops=1) >> >> > Using injection point support we should be able to add tests for >> > testing pg_log_query_plan behaviour when there are page locks held or >> > when auto_explain (with instrumentation) and pg_log_query_plan() work >> > on the same query plan. Use injection point to make the backend >> > running query wait at a suitable point to delay its execution and fire >> > pg_log_query_plan() from other backend. May be the same test could >> > examine the server log file to see if the plan is indeed output to the >> > server log file. >> > >> > Given that the feature will be used when the things have already gone >> > wrong, it should not make things more serious. So more testing and >> > especially automated would help. >> >> Thanks for the advice, it seems a good idea. >> I'm going to try to add tests using injection point. > > Your test with pg_sleep() is a good basic test. But more involved > testing might need something like injection points. It might be so, I will consider whether there are any subtle timing issues, etc. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
pgsql-hackers by date: