Re: crashes due to setting max_parallel_workers=0 - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: crashes due to setting max_parallel_workers=0
Date
Msg-id CAGPqQf03BB_Tu=8TaQeT_5tK32JFcxbHoFPvjLkD=xkS+hhRCw@mail.gmail.com
Whole thread Raw
In response to Re: crashes due to setting max_parallel_workers=0  (Rushabh Lathia <rushabh.lathia@gmail.com>)
Responses Re: crashes due to setting max_parallel_workers=0  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers


On Mon, Mar 27, 2017 at 10:59 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:


On Mon, Mar 27, 2017 at 3:43 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 03/25/2017 05:18 PM, Rushabh Lathia wrote:


On Sat, Mar 25, 2017 at 7:01 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com
<mailto:peter.eisentraut@2ndquadrant.com>> wrote:

    On 3/25/17 09:01, David Rowley wrote:
    > On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lathia@gmail.com <mailto:rushabh.lathia@gmail.com>> wrote:
    >> Also another point which I think we should fix is, when someone set
    >> max_parallel_workers = 0, we should also set the
    >> max_parallel_workers_per_gather
    >> to zero. So that way it we can avoid generating the gather path with
    >> max_parallel_worker = 0.
    > I see that it was actually quite useful that it works the way it does.
    > If it had worked the same as max_parallel_workers_per_gather, then
    > likely Tomas would never have found this bug.

    Another problem is that the GUC system doesn't really support cases
    where the validity of one setting depends on the current value of
    another setting.  So each individual setting needs to be robust against
    cases of related settings being nonsensical.


Okay.

About the original issue reported by Tomas, I did more debugging and
found that - problem was gather_merge_clear_slots() was not returning
the clear slot when nreader is zero (means nworkers_launched = 0).
Due to the same scan was continue even all the tuple are exhausted,
and then end up with server crash at gather_merge_getnext(). In the patch
I also added the Assert into gather_merge_getnext(), about the index
should be less then the nreaders + 1 (leader).

PFA simple patch to fix the problem.


I think there are two issues at play, here - the first one is that we still produce parallel plans even with max_parallel_workers=0, and the second one is the crash in GatherMerge when nworkers=0.

Your patch fixes the latter (thanks for looking into it), which is obviously a good thing - getting 0 workers on a busy system is quite possible, because all the parallel workers can be already chewing on some other query.


Thanks.
 

I was doing more testing with the patch and I found one more server
crash with the patch around same area, when we forced the gather
merge for the scan having zero rows.

create table dept ( deptno numeric, dname varchar(20);
set parallel_tuple_cost =0;
set parallel_setup_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set force_parallel_mode=regress;
explain analyze select * from dept order by deptno;

This is because for leader we don't initialize the slot into gm_slots. So
in case where launched worker is zero and table having zero rows, we
end up having NULL slot into gm_slots array.

Currently gather_merge_clear_slots() clear out the tuple table slots for each
gather merge input and returns clear slot. In the patch I modified function
gather_merge_clear_slots() to just clear out the tuple table slots and
always return NULL when All the queues and heap us exhausted.

 
But it seems a bit futile to produce the parallel plan in the first place, because with max_parallel_workers=0 we can't possibly get any parallel workers ever. I wonder why compute_parallel_worker() only looks at max_parallel_workers_per_gather, i.e. why shouldn't it do:

   parallel_workers = Min(parallel_workers, max_parallel_workers);


I agree with you here. Producing the parallel plan when max_parallel_workers = 0 is wrong. But rather then your suggested fix, I think that we should do something like:
  
    /*
     * In no case use more than max_parallel_workers_per_gather or
     * max_parallel_workers.
     */
    parallel_workers = Min(parallel_workers, Min(max_parallel_workers, max_parallel_workers_per_gather));

 
Perhaps this was discussed and is actually intentional, though.


Yes, I am not quite sure about this.

Regarding handling this at the GUC level - I agree with Peter that that's not a good idea. I suppose we could deal with checking the values in the GUC check/assign hooks, but what we don't have is a way to undo the changes in all the GUCs. That is, if I do

   SET max_parallel_workers = 0;
   SET max_parallel_workers = 16;

I expect to end up with just max_parallel_workers GUC changed and nothing else.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Rushabh Lathia



Regards,
Rushabh Lathia
Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Index usage for elem-contained-by-const-range clauses
Next
From: Kang Yuzhe
Date:
Subject: On How To Shorten The Steep Learning Curve Towards PG Hacking