Re: Proposal: Progressive explain - Mailing list pgsql-hackers

From torikoshia
Subject Re: Proposal: Progressive explain
Date
Msg-id 78cff2165abb881244376988893a87b8@oss.nttdata.com
Whole thread Raw
In response to Re: Proposal: Progressive explain  (Rafael Thofehrn Castro <rafaelthca@gmail.com>)
Responses Re: Proposal: Progressive explain
List pgsql-hackers
On 2025-04-01 15:23, Rafael Thofehrn Castro wrote:
> Hello again,
> 
>> ERROR:  could not attach to dynamic shared area
> 
> In addition to that refactoring issue, the current patch had a race
> condition in pg_stat_progress_explain to access the DSA of a process
> running a query that gets aborted.
> 
> While discussing with Robert we agreed that it would be wiser to take
> a step back and change the strategy used to share progressive explain
> data in shared memory.
> 
> Instead of using per backend's DSAs shared via a hash structure I now
> define a dsa_pointer and a LWLock in each backend's PGPROC.
> 
> A global DSA is created by the first backend that attempts to use
> the progressive explain feature. After the DSA is created, subsequent
> uses of the feature will just allocate memory there and reference
> via PGPROC's dsa_pointer.
> 
> This solves the race condition reported by Torikoshi and improves
> concurrency performance as now we don't have a global LWLock
> controlling shared memory access, but a per-backend LWLock.
> 
> Performed the same tests done by Torikoshi and it looks like we are
> good now. Even with more frequent inspects in pg_stat_progress_explain
> (\watch 0.01).

Thanks for updating the patch!
Have you tested enabling progressive_explain?

When I ran the 'make installcheck' test again setting 
progressive_explain to on, there was the same assertion failure:

   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 116832
   postgres: parallel worker for PID 116822 
(ExceptionalCondition+0x98)[0xb7311ea8bf80]
   postgres: parallel worker for PID 116822 (+0x89de6c)[0xb7311e9ede6c]
   postgres: parallel worker for PID 116822 (+0x89eb68)[0xb7311e9eeb68]
   postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
   postgres: parallel worker for PID 116822 (+0x8a1d10)[0xb7311e9f1d10]
   postgres: parallel worker for PID 116822 (+0x89ed80)[0xb7311e9eed80]
   postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
   postgres: parallel worker for PID 116822 (+0x89f174)[0xb7311e9ef174]
   postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
   postgres: parallel worker for PID 116822 (+0x89f0b8)[0xb7311e9ef0b8]
   postgres: parallel worker for PID 116822 (+0x8928dc)[0xb7311e9e28dc]
   postgres: parallel worker for PID 116822 
(deparse_expression+0x34)[0xb7311e9e2834]
   postgres: parallel worker for PID 116822 (+0x347870)[0xb7311e497870]
   postgres: parallel worker for PID 116822 (+0x3478e4)[0xb7311e4978e4]
   postgres: parallel worker for PID 116822 (+0x347970)[0xb7311e497970]
   ...
   TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File: 
"ruleutils.c", Line: 8802, PID: 116831
   [115650] LOG:  00000: background worker "parallel worker" (PID 116831) 
was terminated by signal 6: Aborted
   [115650] DETAIL:  Failed process was running: explain (analyze, costs 
off, summary off, timing off, buffers off) select count(*) from ab where 
(a = (select 1) or a = (select 3)) and b = 2
   [115650] LOCATION:  LogChildExit, postmaster.c:2846


We can reproduce it as follows:

   show progressive_explain;
   progressive_explain
  ---------------------
   on

   create table ab (a int not null, b int not null) partition by list 
(a);
   create table ab_a2 partition of ab for values in(2) partition by list 
(b);
   create table ab_a2_b1 partition of ab_a2 for values in (1);
   create table ab_a2_b2 partition of ab_a2 for values in (2);
   create table ab_a2_b3 partition of ab_a2 for values in (3);
   create table ab_a1 partition of ab for values in(1) partition by list 
(b);
   create table ab_a1_b1 partition of ab_a1 for values in (1);
   create table ab_a1_b2 partition of ab_a1 for values in (2);
   create table ab_a1_b3 partition of ab_a1 for values in (3);
   create table ab_a3 partition of ab for values in(3) partition by list 
(b);
   create table ab_a3_b1 partition of ab_a3 for values in (1);
   create table ab_a3_b2 partition of ab_a3 for values in (2);
   create table ab_a3_b3 partition of ab_a3 for values in (3);

   set parallel_setup_cost = 0;
   set parallel_tuple_cost = 0;
   set min_parallel_table_scan_size = 0;
   set max_parallel_workers_per_gather = 2;

   explain (analyze, costs off, summary off, timing off, buffers off) 
select count(*) from ab where (a = (select 1) or a = (select 3)) and b = 
2;
   WARNING:  terminating connection because of crash of another server 
process
   DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
   HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
   server closed the connection unexpectedly
           This probably means the server terminated abnormally
           before or while processing the request.
   The connection to the server was lost. Attempting reset: Failed.


Note that there is no need to access pg_stat_progress_explain.

Could you please check if you can reproduce this?



-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Statistics Import and Export
Next
From: Andres Freund
Date:
Subject: Re: AIO v2.5