Re: crashes due to setting max_parallel_workers=0 - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: crashes due to setting max_parallel_workers=0
Date
Msg-id CAGPqQf0CcH+cyU=WK+bUvwgFEVFdGPFaBC2RMRz_hPCEKEFhwg@mail.gmail.com
Whole thread Raw
In response to crashes due to setting max_parallel_workers=0  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: crashes due to setting max_parallel_workers=0  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers


On Tue, Mar 28, 2017 at 10:00 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:


On 03/27/2017 01:40 PM, Rushabh Lathia wrote:

...
I was doing more testing with the patch and I found one more server
crash with the patch around same area, when we forced the gather
merge for the scan having zero rows.

create table dept ( deptno numeric, dname varchar(20);
set parallel_tuple_cost =0;
set parallel_setup_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set force_parallel_mode=regress;
explain analyze select * from dept order by deptno;

This is because for leader we don't initialize the slot into gm_slots. So
in case where launched worker is zero and table having zero rows, we
end up having NULL slot into gm_slots array.

Currently gather_merge_clear_slots() clear out the tuple table slots for each
gather merge input and returns clear slot. In the patch I modified function
gather_merge_clear_slots() to just clear out the tuple table slots and
always return NULL when All the queues and heap us exhausted.


Isn't that just another sign the code might be a bit too confusing? I see two main issues in the code:

1) allocating 'slots' as 'nreaders+1' elements, which seems like a good way to cause off-by-one errors

2) mixing objects with different life spans (slots for readers vs. slot for the leader) seems like a bad idea too

I wonder how much we gain by reusing the slot from the leader (I'd be surprised if it was at all measurable). David posted a patch reworking this, and significantly simplifying the GatherMerge node. Why not to accept that?



I think we all agree that we should get rid of nreaders from the GatherMergeState
and need to do some code re-factor. But if I understood correctly that Robert's
concern was to do that re-factor as separate commit.

I picked David's patch and started reviewing the changes. I applied that patch
on top of my v2 patch (which does the re-factor of gather_merge_clear_slots).

In David's patch, into gather_merge_init(), a loop where tuple array is getting
allocate, that loop need to only up to nworkers_launched. Because we don't
hold the tuple array for leader. I changed that and did some other simple
changes based on mine v2 patch. I also performed manual testing with
the changes.

Please find attached re-factor patch, which is v2 patch submitted for the
server crash fix. (Attaching both the patch here again, for easy of access).

Thanks,


--
Rushabh Lathia
Attachment

pgsql-hackers by date:

Previous
From: Mithun Cy
Date:
Subject: Re: [POC] A better way to expand hash indexes.
Next
From: Pavel Stehule
Date:
Subject: xmltable doc fix and example for XMLNAMESPACES