Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: RFC: Logging plan of the running query
Date
Msg-id CAExHW5ukAdmnq2B8gVJ4ufWFUrQvFYtoMf_fGgpsDkTtEF4oZw@mail.gmail.com
Whole thread Raw
In response to Re: RFC: Logging plan of the running query  (torikoshia <torikoshia@oss.nttdata.com>)
Responses Re: RFC: Logging plan of the running query
Re: RFC: Logging plan of the running query
List pgsql-hackers
On Wed, Feb 7, 2024 at 9:38 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi Ashutosh,
>
> On 2024-02-06 19:51, Ashutosh Bapat wrote:
>
> > Thanks for the summary. It is helpful. I think patch is also getting
> > better.
> >
> > I have a few questions and suggestions
>
> Thanks for your comments.
>
> > 1. Prologue of GetLockMethodLocalHash() mentions
> >  * NOTE: When there are many entries in LockMethodLocalHash, calling
> > this
> >  * function and looking into all of them can lead to performance
> > problems.
> >  */
> > How bad this performance could be. Let's assume that a query is taking
> > time and pg_log_query_plan() is invoked to examine the plan of this
> > query. Is it possible that the looping over all the locks itself takes
> > a lot of time delaying the query execution further?
>
> I think it depends on the number of local locks, but I've measured cpu
> time for this page lock check by adding below codes and
> v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which
> calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just for
> your information:
>
>    diff --git a/src/backend/commands/explain.c
> b/src/backend/commands/explain.c
>    index 5f7d77d567..65b7cb4925 100644
>    --- a/src/backend/commands/explain.c
>    +++ b/src/backend/commands/explain.c
>    @@ -44,6 +44,8 @@
>
>    +#include "time.h"
>    ...
>    @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void)
>             * we check all the LocalLock entries and when finding even
> one, give up
>             * logging the plan.
>             */
>    +       start = clock();
>            hash_seq_init(&status, GetLockMethodLocalHash());
>            while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) !=
> NULL)
>            {
>            if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
>                {
>                    ereport(LOG_SERVER_ONLY,
>                        errmsg("ignored request for logging query plan due
> to page lock conflicts"),
>                        errdetail("You can try again in a moment."));
>                    hash_seq_term(&status);
>
>                    ProcessLogQueryPlanInterruptActive = false;
>                    return;
>                }
>            }
>    +       end = clock();
>    +       cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
>    +
>    +       ereport(LOG,
>    +               errmsg("all locallock entry search took: %f",
> cpu_time_used));
>    +
>
> There were about 3 million log lines which recorded the cpu time, and
> the duration was quite short:
>
>    =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d.
>    =# select max(d), min(d), avg(d) from cpu_time ;
>       max    | min |          avg
>    ----------+-----+-----------------------
>     0.000116 |   0 | 4.706274625332238e-07
>
> I'm not certain that this is valid for actual use cases, but these
> results seem to suggest that it will not take that long.

What load did you run? I don't think any query in make check would
take say thousands of locks. The prologue refers to a very populated
lock hash table. I think that will happen if thousands of tables are
queried in a single query OR a query runs on a partitioned table with
thousands of partitions. May be we want to try that scenario.

>
>
> > 2. What happens if auto_explain is enabled in the backend and
> > pg_log_query_plan() is called on the same backend? Will they conflict?
> > I think we should add a test for the same.
>
> Hmm, I think they don't conflict since they just refer QueryDesc and
> don't modify it and don't use same objects for locking.
> (I imagine 'conflict' here is something like 'hard conflict' in
> replication[2].)

By conflict, I mean the two features behave weird when used together
e.g give wrong results or crash etc.

>
> Actually using both auto_explain and pg_log_query_plan() output each
> logs separately:
>
>    (pid:62835)=# select pg_sleep(10);
>    (pid:70000)=# select pg_log_query_plan(62835);
>
>    (pid:70000)=# \! cat data/log/postgres.log
>    ...
>    2024-02-06 21:44:17.837 JST [62835:4:0] LOG:  00000: query plan
> running on backend with PID 62835 is:
>          Query Text: select pg_sleep(10);
>          Result  (cost=0.00..0.01 rows=1 width=4)
>            Output: pg_sleep('10'::double precision)
>          Query Identifier: 3506829283127886044
>    2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION:
> ProcessLogQueryPlanInterrupt, explain.c:5336
>    2024-02-06 21:44:26.974 JST [62835:6:0] LOG:  00000: duration:
> 10000.868 ms  plan:
>          Query Text: select pg_sleep(10);
>          Result  (cost=0.00..0.01 rows=1 width=4) (actual
> time=10000.802..10000.804 rows=1 loops=1)
>
> > Using injection point support we should be able to add tests for
> > testing pg_log_query_plan behaviour when there are page locks held or
> > when auto_explain (with instrumentation) and pg_log_query_plan() work
> > on the same query plan. Use injection point to make the backend
> > running query wait at a suitable point to delay its execution and fire
> > pg_log_query_plan() from other backend. May be the same test could
> > examine the server log file to see if the plan is indeed output to the
> > server log file.
> >
> > Given that the feature will be used when the things have already gone
> > wrong, it should not make things more serious. So more testing and
> > especially automated would help.
>
> Thanks for the advice, it seems a good idea.
> I'm going to try to add tests using injection point.

Your test with pg_sleep() is a good basic test. But more involved
testing might need something like injection points.

--
Best Wishes,
Ashutosh Bapat



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: speed up a logical replica setup
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: speed up a logical replica setup