... I was doing more testing with the patch and I found one more server crash with the patch around same area, when we forced the gather merge for the scan having zero rows.
create table dept ( deptno numeric, dname varchar(20); set parallel_tuple_cost =0; set parallel_setup_cost =0; set min_parallel_table_scan_size =0; set min_parallel_index_scan_size =0; set force_parallel_mode=regress; explain analyze select * from dept order by deptno;
This is because for leader we don't initialize the slot into gm_slots. So in case where launched worker is zero and table having zero rows, we end up having NULL slot into gm_slots array.
Currently gather_merge_clear_slots() clear out the tuple table slots for each gather merge input and returns clear slot. In the patch I modified function gather_merge_clear_slots() to just clear out the tuple table slots and always return NULL when All the queues and heap us exhausted.
Isn't that just another sign the code might be a bit too confusing? I see two main issues in the code:
1) allocating 'slots' as 'nreaders+1' elements, which seems like a good way to cause off-by-one errors
2) mixing objects with different life spans (slots for readers vs. slot for the leader) seems like a bad idea too
I wonder how much we gain by reusing the slot from the leader (I'd be surprised if it was at all measurable). David posted a patch reworking this, and significantly simplifying the GatherMerge node. Why not to accept that?
I think we all agree that we should get rid of nreaders from the GatherMergeState
and need to do some code re-factor. But if I understood correctly that Robert's
concern was to do that re-factor as separate commit.
I picked David's patch and started reviewing the changes. I applied that patch
on top of my v2 patch (which does the re-factor of gather_merge_clear_slots).
In David's patch, into gather_merge_init(), a loop where tuple array is getting
allocate, that loop need to only up to nworkers_launched. Because we don't
hold the tuple array for leader. I changed that and did some other simple
changes based on mine v2 patch. I also performed manual testing with the changes.
Please find attached re-factor patch, which is v2 patch submitted for the
server crash fix. (Attaching both the patch here again, for easy of access).