Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | CAExHW5ukAdmnq2B8gVJ4ufWFUrQvFYtoMf_fGgpsDkTtEF4oZw@mail.gmail.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (torikoshia <torikoshia@oss.nttdata.com>) |
Responses |
Re: RFC: Logging plan of the running query
Re: RFC: Logging plan of the running query |
List | pgsql-hackers |
On Wed, Feb 7, 2024 at 9:38 AM torikoshia <torikoshia@oss.nttdata.com> wrote: > > Hi Ashutosh, > > On 2024-02-06 19:51, Ashutosh Bapat wrote: > > > Thanks for the summary. It is helpful. I think patch is also getting > > better. > > > > I have a few questions and suggestions > > Thanks for your comments. > > > 1. Prologue of GetLockMethodLocalHash() mentions > > * NOTE: When there are many entries in LockMethodLocalHash, calling > > this > > * function and looking into all of them can lead to performance > > problems. > > */ > > How bad this performance could be. Let's assume that a query is taking > > time and pg_log_query_plan() is invoked to examine the plan of this > > query. Is it possible that the looping over all the locks itself takes > > a lot of time delaying the query execution further? > > I think it depends on the number of local locks, but I've measured cpu > time for this page lock check by adding below codes and > v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which > calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just for > your information: > > diff --git a/src/backend/commands/explain.c > b/src/backend/commands/explain.c > index 5f7d77d567..65b7cb4925 100644 > --- a/src/backend/commands/explain.c > +++ b/src/backend/commands/explain.c > @@ -44,6 +44,8 @@ > > +#include "time.h" > ... > @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void) > * we check all the LocalLock entries and when finding even > one, give up > * logging the plan. > */ > + start = clock(); > hash_seq_init(&status, GetLockMethodLocalHash()); > while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != > NULL) > { > if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE) > { > ereport(LOG_SERVER_ONLY, > errmsg("ignored request for logging query plan due > to page lock conflicts"), > errdetail("You can try again in a moment.")); > hash_seq_term(&status); > > ProcessLogQueryPlanInterruptActive = false; > return; > } > } > + end = clock(); > + cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC; > + > + ereport(LOG, > + errmsg("all locallock entry search took: %f", > cpu_time_used)); > + > > There were about 3 million log lines which recorded the cpu time, and > the duration was quite short: > > =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d. > =# select max(d), min(d), avg(d) from cpu_time ; > max | min | avg > ----------+-----+----------------------- > 0.000116 | 0 | 4.706274625332238e-07 > > I'm not certain that this is valid for actual use cases, but these > results seem to suggest that it will not take that long. What load did you run? I don't think any query in make check would take say thousands of locks. The prologue refers to a very populated lock hash table. I think that will happen if thousands of tables are queried in a single query OR a query runs on a partitioned table with thousands of partitions. May be we want to try that scenario. > > > > 2. What happens if auto_explain is enabled in the backend and > > pg_log_query_plan() is called on the same backend? Will they conflict? > > I think we should add a test for the same. > > Hmm, I think they don't conflict since they just refer QueryDesc and > don't modify it and don't use same objects for locking. > (I imagine 'conflict' here is something like 'hard conflict' in > replication[2].) By conflict, I mean the two features behave weird when used together e.g give wrong results or crash etc. > > Actually using both auto_explain and pg_log_query_plan() output each > logs separately: > > (pid:62835)=# select pg_sleep(10); > (pid:70000)=# select pg_log_query_plan(62835); > > (pid:70000)=# \! cat data/log/postgres.log > ... > 2024-02-06 21:44:17.837 JST [62835:4:0] LOG: 00000: query plan > running on backend with PID 62835 is: > Query Text: select pg_sleep(10); > Result (cost=0.00..0.01 rows=1 width=4) > Output: pg_sleep('10'::double precision) > Query Identifier: 3506829283127886044 > 2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION: > ProcessLogQueryPlanInterrupt, explain.c:5336 > 2024-02-06 21:44:26.974 JST [62835:6:0] LOG: 00000: duration: > 10000.868 ms plan: > Query Text: select pg_sleep(10); > Result (cost=0.00..0.01 rows=1 width=4) (actual > time=10000.802..10000.804 rows=1 loops=1) > > > Using injection point support we should be able to add tests for > > testing pg_log_query_plan behaviour when there are page locks held or > > when auto_explain (with instrumentation) and pg_log_query_plan() work > > on the same query plan. Use injection point to make the backend > > running query wait at a suitable point to delay its execution and fire > > pg_log_query_plan() from other backend. May be the same test could > > examine the server log file to see if the plan is indeed output to the > > server log file. > > > > Given that the feature will be used when the things have already gone > > wrong, it should not make things more serious. So more testing and > > especially automated would help. > > Thanks for the advice, it seems a good idea. > I'm going to try to add tests using injection point. Your test with pg_sleep() is a good basic test. But more involved testing might need something like injection points. -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: