Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

From torikoshia
Subject Re: RFC: Logging plan of the running query
Date
Msg-id a82b1a3f677d2a3aa95083c394e12843@oss.nttdata.com
Whole thread Raw
In response to Re: RFC: Logging plan of the running query  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: RFC: Logging plan of the running query
List pgsql-hackers
On 2024-02-07 13:58, Ashutosh Bapat wrote:
> On Wed, Feb 7, 2024 at 9:38 AM torikoshia <torikoshia@oss.nttdata.com> 
> wrote:
>> 
>> Hi Ashutosh,
>> 
>> On 2024-02-06 19:51, Ashutosh Bapat wrote:
>> 
>> > Thanks for the summary. It is helpful. I think patch is also getting
>> > better.
>> >
>> > I have a few questions and suggestions
>> 
>> Thanks for your comments.
>> 
>> > 1. Prologue of GetLockMethodLocalHash() mentions
>> >  * NOTE: When there are many entries in LockMethodLocalHash, calling
>> > this
>> >  * function and looking into all of them can lead to performance
>> > problems.
>> >  */
>> > How bad this performance could be. Let's assume that a query is taking
>> > time and pg_log_query_plan() is invoked to examine the plan of this
>> > query. Is it possible that the looping over all the locks itself takes
>> > a lot of time delaying the query execution further?
>> 
>> I think it depends on the number of local locks, but I've measured cpu
>> time for this page lock check by adding below codes and
>> v27-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch[1], which
>> calls ProcessLogQueryPlanInterrupt() in every CFI on my laptop just 
>> for
>> your information:
>> 
>>    diff --git a/src/backend/commands/explain.c
>> b/src/backend/commands/explain.c
>>    index 5f7d77d567..65b7cb4925 100644
>>    --- a/src/backend/commands/explain.c
>>    +++ b/src/backend/commands/explain.c
>>    @@ -44,6 +44,8 @@
>> 
>>    +#include "time.h"
>>    ...
>>    @@ -5287,6 +5292,7 @@ ProcessLogQueryPlanInterrupt(void)
>>             * we check all the LocalLock entries and when finding even
>> one, give up
>>             * logging the plan.
>>             */
>>    +       start = clock();
>>            hash_seq_init(&status, GetLockMethodLocalHash());
>>            while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) 
>> !=
>> NULL)
>>            {
>>            if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)
>>                {
>>                    ereport(LOG_SERVER_ONLY,
>>                        errmsg("ignored request for logging query plan 
>> due
>> to page lock conflicts"),
>>                        errdetail("You can try again in a moment."));
>>                    hash_seq_term(&status);
>> 
>>                    ProcessLogQueryPlanInterruptActive = false;
>>                    return;
>>                }
>>            }
>>    +       end = clock();
>>    +       cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
>>    +
>>    +       ereport(LOG,
>>    +               errmsg("all locallock entry search took: %f",
>> cpu_time_used));
>>    +
>> 
>> There were about 3 million log lines which recorded the cpu time, and
>> the duration was quite short:
>> 
>>    =# -- Extracted cpu_time_used from log and loaded it to cpu_time.d.
>>    =# select max(d), min(d), avg(d) from cpu_time ;
>>       max    | min |          avg
>>    ----------+-----+-----------------------
>>     0.000116 |   0 | 4.706274625332238e-07
>> 
>> I'm not certain that this is valid for actual use cases, but these
>> results seem to suggest that it will not take that long.
> 
> What load did you run? I don't think any query in make check would
> take say thousands of locks.

Sorry, I forgot to write it but ran make check as you imagined.

> The prologue refers to a very populated
> lock hash table. I think that will happen if thousands of tables are
> queried in a single query OR a query runs on a partitioned table with
> thousands of partitions. May be we want to try that scenario.

OK, I'll try such cases.

>> > 2. What happens if auto_explain is enabled in the backend and
>> > pg_log_query_plan() is called on the same backend? Will they conflict?
>> > I think we should add a test for the same.
>> 
>> Hmm, I think they don't conflict since they just refer QueryDesc and
>> don't modify it and don't use same objects for locking.
>> (I imagine 'conflict' here is something like 'hard conflict' in
>> replication[2].)
> 
> By conflict, I mean the two features behave weird when used together
> e.g give wrong results or crash etc.
> 
>> 
>> Actually using both auto_explain and pg_log_query_plan() output each
>> logs separately:
>> 
>>    (pid:62835)=# select pg_sleep(10);
>>    (pid:70000)=# select pg_log_query_plan(62835);
>> 
>>    (pid:70000)=# \! cat data/log/postgres.log
>>    ...
>>    2024-02-06 21:44:17.837 JST [62835:4:0] LOG:  00000: query plan
>> running on backend with PID 62835 is:
>>          Query Text: select pg_sleep(10);
>>          Result  (cost=0.00..0.01 rows=1 width=4)
>>            Output: pg_sleep('10'::double precision)
>>          Query Identifier: 3506829283127886044
>>    2024-02-06 21:44:17.837 JST [62835:5:0] LOCATION:
>> ProcessLogQueryPlanInterrupt, explain.c:5336
>>    2024-02-06 21:44:26.974 JST [62835:6:0] LOG:  00000: duration:
>> 10000.868 ms  plan:
>>          Query Text: select pg_sleep(10);
>>          Result  (cost=0.00..0.01 rows=1 width=4) (actual
>> time=10000.802..10000.804 rows=1 loops=1)
>> 
>> > Using injection point support we should be able to add tests for
>> > testing pg_log_query_plan behaviour when there are page locks held or
>> > when auto_explain (with instrumentation) and pg_log_query_plan() work
>> > on the same query plan. Use injection point to make the backend
>> > running query wait at a suitable point to delay its execution and fire
>> > pg_log_query_plan() from other backend. May be the same test could
>> > examine the server log file to see if the plan is indeed output to the
>> > server log file.
>> >
>> > Given that the feature will be used when the things have already gone
>> > wrong, it should not make things more serious. So more testing and
>> > especially automated would help.
>> 
>> Thanks for the advice, it seems a good idea.
>> I'm going to try to add tests using injection point.
> 
> Your test with pg_sleep() is a good basic test. But more involved
> testing might need something like injection points.

It might be so, I will consider whether there are any subtle timing 
issues, etc.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: pgsql: Add EXPLAIN (MEMORY) to report planner memory consumption
Next
From: Peter Eisentraut
Date:
Subject: Re: Possibility to disable `ALTER SYSTEM`