Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers

From torikoshia
Subject Re: RFC: Logging plan of the running query
Date
Msg-id 3d345c49561f31a1d8ff60ea86f88840@oss.nttdata.com
Whole thread Raw
In response to Re: RFC: Logging plan of the running query  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: RFC: Logging plan of the running query
List pgsql-hackers
On 2021-08-20 01:12, Fujii Masao wrote:
> On 2021/08/11 21:14, torikoshia wrote:
>> As far as I looked into, pg_log_current_plan() can call InstrEndLoop() 
>> through ExplainNode().
>> I added a flag to ExplainState to avoid calling InstrEndLoop() when 
>> ExplainNode() is called from pg_log_current_plan().
> 
> Thanks for updating the patch!
> I tried to test the patch again and encountered two issues.

Thanks for finding these issues!

> 
> (1)
> The following WITH RECURSIVE query failed with the error
> "ERROR:  failed to find plan for CTE sg" when I ran
> pg_log_current_query_plan() against the backend executing that query.
> Is this a bug?
> 
>     create table graph0( f int, t int, label text );
>     insert into graph0 values (1, 2, 'arc 1 -> 2'),(1, 3, 'arc 1 ->
> 3'),(2, 3, 'arc 2 -> 3'),(1, 4, 'arc 1 -> 4'),(4, 5, 'arc 4 -> 5');
> 
>     with recursive search_graph(f, t, label, i) as (
>         select *, 1||pg_sleep(1)::text from graph0 g
>         union distinct
>         select g.*,1||pg_sleep(1)::text
>         from graph0 g, search_graph sg
>        where g.f = sg.t
>     ) search breadth first by f, t set seq
>     select * from search_graph order by seq;

This ERROR occurred without applying the patch and just calling 
EXPLAIN(VERBOSE) to CTE with SEARCH BREADTH FIRST.

I'm going to make another thread to discuss it.

> (2)
> When I ran pg_log_current_query_plan() while "make installcheck" test
> was running, I got the following assertion failure.
> 
> TRAP: FailedAssertion("!IsPageLockHeld || (locktag->locktag_type ==
> LOCKTAG_RELATION_EXTEND)", File: "lock.c", Line: 894, PID: 61512)
> 
> 0   postgres                            0x000000010ec23557
> ExceptionalCondition + 231
> 1   postgres                            0x000000010e9eff15
> LockAcquireExtended + 1461
> 2   postgres                            0x000000010e9ed14d 
> LockRelationOid + 61
> 3   postgres                            0x000000010e41251b 
> relation_open + 91
> 4   postgres                            0x000000010e509679 table_open + 
> 25
> 5   postgres                            0x000000010ebf9462
> SearchCatCacheMiss + 274
> 6   postgres                            0x000000010ebf5979
> SearchCatCacheInternal + 761
> 7   postgres                            0x000000010ebf566c 
> SearchCatCache + 60
> 8   postgres                            0x000000010ec1a9e0 
> SearchSysCache + 144
> 9   postgres                            0x000000010ec1ae03
> SearchSysCacheExists + 51
> 10  postgres                            0x000000010e58ce35 
> TypeIsVisible + 437
> 11  postgres                            0x000000010ea98e4c
> format_type_extended + 1964
> 12  postgres                            0x000000010ea9900e
> format_type_with_typemod + 30
> 13  postgres                            0x000000010eb78d76 
> get_const_expr + 742
> 14  postgres                            0x000000010eb79bc8 
> get_rule_expr + 232
> 15  postgres                            0x000000010eb8140f 
> get_func_expr + 1247
> 16  postgres                            0x000000010eb79dcd 
> get_rule_expr + 749
> 17  postgres                            0x000000010eb81688
> get_rule_expr_paren + 136
> 18  postgres                            0x000000010eb7bf38 
> get_rule_expr + 9304
> 19  postgres                            0x000000010eb72ad5
> deparse_expression_pretty + 149
> 20  postgres                            0x000000010eb73463
> deparse_expression + 83
> 21  postgres                            0x000000010e68eaf1 
> show_plan_tlist + 353
> 22  postgres                            0x000000010e68adaf ExplainNode 
> + 4991
> 23  postgres                            0x000000010e688b4b
> ExplainPrintPlan + 283
> 24  postgres                            0x000000010e68e1aa
> ProcessLogCurrentPlanInterrupt + 266
> 25  postgres                            0x000000010ea133bb
> ProcessInterrupts + 3435
> 26  postgres                            0x000000010e738c97
> vacuum_delay_point + 55
> 27  postgres                            0x000000010e42bb4b
> ginInsertCleanup + 1531
> 28  postgres                            0x000000010e42d418
> gin_clean_pending_list + 776
> 29  postgres                            0x000000010e74955a 
> ExecInterpExpr + 2522
> 30  postgres                            0x000000010e7487e2
> ExecInterpExprStillValid + 82
> 31  postgres                            0x000000010e7ae83b
> ExecEvalExprSwitchContext + 59
> 32  postgres                            0x000000010e7ae7be ExecProject 
> + 78
> 33  postgres                            0x000000010e7ae4e9 ExecResult + 
> 345
> 34  postgres                            0x000000010e764e42
> ExecProcNodeFirst + 82
> 35  postgres                            0x000000010e75ccb2 ExecProcNode 
> + 50
> 36  postgres                            0x000000010e758301 ExecutePlan 
> + 193
> 37  postgres                            0x000000010e7581d1
> standard_ExecutorRun + 609
> 38  auto_explain.so                     0x000000010f1df3a7
> explain_ExecutorRun + 247
> 39  postgres                            0x000000010e757f3b ExecutorRun 
> + 91
> 40  postgres                            0x000000010ea1cb49 
> PortalRunSelect + 313
> 41  postgres                            0x000000010ea1c4dd PortalRun + 
> 861
> 42  postgres                            0x000000010ea17474
> exec_simple_query + 1540
> 43  postgres                            0x000000010ea164d4 PostgresMain 
> + 2580
> 44  postgres                            0x000000010e91d159 BackendRun + 
> 89
> 45  postgres                            0x000000010e91c6a5 
> BackendStartup + 565
> 46  postgres                            0x000000010e91b3fe ServerLoop + 
> 638
> 47  postgres                            0x000000010e918b9d 
> PostmasterMain + 6717
> 48  postgres                            0x000000010e7efd43 main + 819
> 49  libdyld.dylib                       0x00007fff6a46e3d5 start + 1
> 50  ???                                 0x0000000000000003 0x0 + 3
> 
> LOG:  server process (PID 61512) was terminated by signal 6: Abort 
> trap: 6
> DETAIL:  Failed process was running: select
> gin_clean_pending_list('t_gin_test_tbl_i_j_idx') is not null;

As far as I understand, since explaining plans can acquire heavyweight 
lock for example to get column names, when page lock is held at the time 
of the interrupt, this assertion error occurs.

The attached patch tries to avoid this by checking each LocalLock entry 
and when finding even one, giving up logging the plan.

Thoughts?


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)
Next
From: torikoshia
Date:
Subject: EXPLAIN(VERBOSE) to CTE with SEARCH BREADTH FIRST fails