Re: crashes due to setting max_parallel_workers=0 - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: crashes due to setting max_parallel_workers=0
Date
Msg-id CAGPqQf2=s2i578Lw5fao+EaN_FJXTXQ5MGdvJq8ECZofGOxveA@mail.gmail.com
Whole thread Raw
In response to Re: crashes due to setting max_parallel_workers=0  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers


On Mon, Mar 27, 2017 at 9:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Mar 27, 2017 at 12:11 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 28 March 2017 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Mar 25, 2017 at 12:18 PM, Rushabh Lathia
>> <rushabh.lathia@gmail.com> wrote:
>>> About the original issue reported by Tomas, I did more debugging and
>>> found that - problem was gather_merge_clear_slots() was not returning
>>> the clear slot when nreader is zero (means nworkers_launched = 0).
>>> Due to the same scan was continue even all the tuple are exhausted,
>>> and then end up with server crash at gather_merge_getnext(). In the patch
>>> I also added the Assert into gather_merge_getnext(), about the index
>>> should be less then the nreaders + 1 (leader).
>>
>> Well, you and David Rowley seem to disagree on what the fix is here.
>> His patches posted upthread do A, and yours do B, and from a quick
>> look those things are not just different ways of spelling the same
>> underlying fix, but actually directly conflicting ideas about what the
>> fix should be.  Any chance you can review his patches, and maybe he
>> can review yours, and we could try to agree on a consensus position?
>> :-)
>
> When comparing Rushabh's to my minimal patch, both change
> gather_merge_clear_slots() to clear the leader's slot. My fix
> mistakenly changes things so it does ExecInitExtraTupleSlot() on the
> leader's slot, but seems that's not required since
> gather_merge_readnext() sets the leader's slot to the output of
> ExecProcNode(outerPlan). I'd ignore my minimal fix because of that
> mistake. Rushabh's patch sidesteps this by adding a conditional
> pfree() to not free slot that we didn't allocate in the first place.
>
> I do think the code could be improved a bit. I don't like the way
> GatherMergeState's nreaders and nworkers_launched are always the same.
> I think this all threw me off a bit and may have been the reason for
> the bug in the first place.

Yeah, if those fields are always the same, then I think that they
should be merged.  That seems undeniably a good idea. 

Hmm I agree that it's good idea, and I will work on that as separate patch.
 
Possibly we
should fix the crash bug first, though, and then do that afterwards.
What bugs me a little about Rushabh's fix is that it looks like magic.
You have to know that we're looping over two things and freeing them
up, but there's one more of one thing than the other thing.  I think
that at least needs some comments or something.


So in my second version of patch I change  gather_merge_clear_slots() to
just clear the slot for the worker and some other clean up. Also throwing
NULL from gather_merge_getnext() when all the queues and heap are
exhausted - which earlier gather_merge_clear_slots() was returning clear
slot. This way we make sure that we don't run over freeing the slot for
the leader and gather_merge_getnext() don't need to depend on that
clear slot.


Thanks,
Rushabh Lathia

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: crashes due to setting max_parallel_workers=0
Next
From: Andres Freund
Date:
Subject: Re: WIP: Faster Expression Processing v4